predicting function from sequence peer bork embl & mdc heidelberg & berlin...
Post on 23-Dec-2015
221 Views
Preview:
TRANSCRIPT
Predicting function Predicting function from sequencefrom sequence
Peer Bork
EMBL & MDC
Heidelberg & Berlin
bork@embl-heidelberg.dehttp://www.bork.embl-heidelberg.de/
www.bork.embl-heidelberg.de
BioinformaticsBioinformatics Generation of information Generation of information
(biophysics)(biophysics) Storage and retrieval of information Storage and retrieval of information
(informatics for biodatabases)(informatics for biodatabases) Translation of information into Translation of information into
knowledge (computational biology)knowledge (computational biology)
Chance Chance of deducing of deducing structural and structural and functional features functional features by homologyby homology
Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction
www.bork.embl-heidelberg.de
Function prediction from Function prediction from sequencesequence
Function and domain prediction Function and domain prediction
Function prediction by gene context Function prediction by gene context
Quality and heterogeneity of dataQuality and heterogeneity of data
Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle
Quality and heterogeneity of dataQuality and heterogeneity of data
www.bork.embl-heidelberg.de
Challenges despite highly similar sequences due to sequencing errors and other artefacts
Challenges due to low sequence similarity, paralogy and multiple domains
Algorithmic challenges versus data quality and biological diversity
www.bork.embl-heidelberg.de
Number of human genes in time
Aug00 Apr01Oct00 Dec00 Feb01Mar00 0
100
120
20
40
80
60
HGS, Incyte and coTextbooks, public opinion
Celera
HGP38 32
5239
27 24 21
No h
uman
gen
es in
thou
sand
s
HGS
others
Nature 304, 16. November 2000
www.bork.embl-heidelberg.de
Heterogenous data from large scale approaches
Gene expression (correlation to proteins poor)
Yeast two hybrid (8% overlap with each other)
Many others….
Mycoplasma pneumoniaeMycoplasma pneumoniae predictions predictions
0
20
40
60
80
100
1995Function
1995Structure
1999Function
1999Structure
fold twilightFoldsContextTwilightHomology
Dandekar et al., 2000 NAR Sep
www.bork.embl-heidelberg.de
Mycoplasma pneumoniaeMycoplasma pneumoniae re-annotation re-annotation1995 1995 vsvs 1999 1999
ORFs: ORFs: +12 -1+12 -1 = 688 = 688
RNAs: RNAs: +9+9 = 42 = 42
ORFs with functions: ORFs with functions: +105 +105 = 458 = 458
ORFs changed: ORFs changed: 16 extended, 8 shortened16 extended, 8 shortened
Function changed: Function changed: 30 more, 18 less specific30 more, 18 less specific
57% of all entries were re-annotated !57% of all entries were re-annotated !
www.bork.embl-heidelberg.de
Function prediction from Function prediction from sequencesequence
Function and domain prediction Function and domain prediction
Function prediction by gene context Function prediction by gene context
Quality and heterogeneity of dataQuality and heterogeneity of data
Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle
Quality and heterogeneity of dataQuality and heterogeneity of data
Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle
70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference
Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.
Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998
Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein
Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999
Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999
Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein
Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999
GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999
Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996
Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein
Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996
Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998
Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein
Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein
Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99
Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b
Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998
Clear homology via Blast; yet, misleading Clear homology via Blast; yet, misleading annotation hampers automatic function predictionannotation hampers automatic function prediction
Phylogenetic tree of Blast hits revealsPhylogenetic tree of Blast hits reveals that no function prediction is possiblethat no function prediction is possible
Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence
Henikoff et al. 1997 Science 278, 609
www.bork.embl-heidelberg.de
Function prediction from Function prediction from sequencesequence
Function and domain prediction Function and domain prediction
Function prediction by gene context Function prediction by gene context
Quality and heterogeneity of dataQuality and heterogeneity of data
Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdlePrediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle
Function and domain prediction Function and domain prediction
HE
CT
c
HECTcC2
C2
WW
WW
WW
W WW W
W WW W
Rsp5 from Yeast
Ned
4 fr
om H
uman
Dotplot to reveal residue conservationDotplot to reveal residue conservation
Repeat pattern
Conserved domain
Conserved domain
www.bork.embl-heidelberg.de
Domain insertion
www.bork.embl-heidelberg.de
Function prediction for disease genesFunction prediction for disease genes
Breast cancer gene BRCA1Breast cancer gene BRCA1
Positionally cloned 1994 (Miki et al. Science 266, pp66)
Features originally deduced from the 1857aa sequence:Contains a RING finger (30aa, usually bind diverse proteins)
Function unknown, even localization unclear
www.bork.embl-heidelberg.de
Localization experiments on BRCA1Localization experiments on BRCA1
A strong candidate for the breastand ovarian cancer susceptibilitygeneScience 266, 1994
?
Abberrant subcellular localizationof BRCA1 in breast cancerScience 270, Nov. 1995
Cytoplasmic(nuclear)
Growth retardation and tumorinhibition by BRCA1Nature Genet. 12, March 1996
Nuclear ?
BRCA1 is secreted and exhibitsproperties of a graninNature Genet. 12, March 1996
Extracellular
Location of BRCA1 in breast andovarian cancerScience 272, April 1966
Nuclear andcytoplasmic
Title/JournalTitle/Journal ConclusionConclusion
Domain discovery in BRCA1Domain discovery in BRCA1
Domain discovery in disease genesDomain discovery in disease genesgene/protein disease domains reference
dystrophin Muscular dystrophy WW Bork & Sudol: TIBS 19(94)531
X11 Friedreich's ataxia (c) PI/PTB+PDZ Bork & Margolis: Cell 80(95)693
PKD1 Polycystic kidney many (PKD1) Int. PKD1 consortium: Cell 81(95)298
HD Huntington's HEAT repeats Andrade & Bork: Nat.Genet.11(95)115
BRCA2 Breast cancer BRC repeats Bork et al.: Nat. Genet. 13 (96) 22
BRCA1 Breast cancer BRCT Koonin et al.: Nat. Genet. 13 (96) 266
dsh DiGeorge syndrome DEP Ponting & Bork: TIBS 21(96) 245
X25 (FRDA) Friedreich's ataxia CyaY Gibson et al. : TINS 19 (96) 465
beige/CH Chediak-Higashi BEACH Nagle et al. : Nat. Genet. 14 (96) 307
RB Retinoblastoma BRCT Bork et al. :FASEB J. 11 (97) 68
9 incl. HML1 Colon cancer HSP90 Mushegian et al. : PNAS 94 (97) 5831
TSG101 Breast cancer UBC Ponting, Cai & Bork: JMM 75 (97) 467
WRN/BLM Werner + Bloom syn. HRDC Morozov et al. : TIBS 22 (97) 417
2 inc pyrin Mediterrian fever SPRY Schultz et al. : PNAS 95 (98) 5857
p73 various tumors? SAM Bork & Koonin: Nat. Genet. 18 (98) 313
mahagony Obesity PSI Nagle et al.: Nature 398 (99) 148
Parkin AP-J Parkinsonism IBR Morett & Bork: TIBS 24 (99) 229
SMARTSMARTBlast-like inputBlast-like input
- ID or AC sufficient
- Access to different databases
- Domain annotation
www.smart.embl-heidelberg.de
SMARTSMART
Digested outputDigested output
-signal sequence
-transmembrane regions
SMARTSMART
-comparison of domain context
www.smart.embl-heidelberg.de
www.bork.embl-heidelberg.de
Non-globuar functional features in Non-globuar functional features in protein sequencesprotein sequences
Transmembrane regions signal sequences GPI anchors coiled-coiled other compositionally biased regions (short internal repeats)
SMARTSMARTBlast with “in between”Blast with “in between”
regionsregions
-automatically cuts respective region
-cut and paste for other programs
-some specific output features
www.smart.embl-heidelberg.de
Digested outputDigested output
-signal sequence
-transmembrane regions
SMARTSMART
-comparison of domain context
www.smart.embl-heidelberg.de
SMARTSMARTDomain annotationDomain annotation
-multiple alignment
-consensus features
-residue annotation
-search options
-description
www.smart.embl-heidelberg.de
SMARTSMARTSpecies distributionSpecies distribution
-total occurrence
-taxonomic break down
-model organisms
-protein and domainstatistics
www.smart.embl-heidelberg.de
www.bork.embl-heidelberg.de
Domain architecture of C35B8.2 C. elegans
Query: VAV H. sapiens
Reconstructed structure of C35B8.2
Annotation improvement using Annotation improvement using domain correlationdomain correlation
SH3
Find closest hit: selective SMART
Evaluate correlation; scan genome region
www.bork.embl-heidelberg.de
Domain organization of TAPDomain organization of TAP
LRR
LRR
LRR
LRR
NTF2-like UBA
100aa
RNA-binding p15-binding
np-bind.
Directed mutagenesisDirected mutagenesis
619aaTAPTAP Random mutagenesisRandom mutagenesis
Collaboration with Elisa Izaurralde
NTF2-like
p15
Directed mutagenesis confirmsDirected mutagenesis confirmspredicted predicted TAPTAP//p15p15 interaction interaction
Red - loss of binding
Blue - no effect on binding Gray - alanine scan
Human genome reveals whole TAP family
Independent duplications Independent duplications in fly, worm and humanin fly, worm and human
In 90% of the human In 90% of the human genome: 6 homologues, genome: 6 homologues, but of thesebut of these1-2 pseudogenes1-2 pseudogenes
TAP
www.bork.embl-heidelberg.de
Sequenced eukaryotic genomesBork and Copley N
ature 409(01)818
History of signaling domain discovery: History of signaling domain discovery: Novel nuclear and cytoplasmic domainsNovel nuclear and cytoplasmic domains
0
5
10
15
20
25
30
35
<198
5
87/8
8
91/9
2
95/9
6
99/2
0
cytoplasmic domainsnuclear domains SystematicSystematic
approachapproachby by searching searching ‘in between’‘in between’regionsregions
Top 10 domains* in humanTop 10 domains* in humanman fly worm yeast cress
ImmunoglobulinC2H2zinc finger
*Only no of genes given, no of domains higher; note that only around 90% is sequenced
Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat
765(381) 140 64 0 1706(607) 357 151 48 115575(501) 319 437 121 1049569(616) 97 358 0 16433 198 183 97 331350 10 50 6 80300(224) 157 96 54 255277(136) 162 102 91 210276(145) 105 107 19 120
13300 18200 6100 25700
Nature 409 (01)860; Science 291(01)1304
Total no genesSpecies
Homeobox 267(160) 148 109 9 118
26500(26500)
Top 10 mobile domains in humanTop 10 mobile domains in humanman fly worm yeast cress
C2H2zinc fingerImmunoglobulin
Only no of domains given, no of proteins lower; note that only around 90% is sequenced
EGFWD40(G-protein)Ankyrin repeatCadherin domainProtein kinasesFibronectin type 3
5653 1778 587 104 2551364 457 530 0 21207 466 539 1 53894 678 488 340 1022714 363 344 38 261622 201 113 0 0586 259 462 122 1054557 217 212 2 6443 242 183 94 460
26500 13300 18200 6100 25700
SMART analysis of 31700 predicted human ORFs
Total no genesSpecies
CCP/sushi/SCR 277 64 80 0 0RRM (RNA-binding)
Correlation between domains
extra
intra
nuclear
otherMarker
PX
www.bork.embl-heidelberg.de
Function prediction from Function prediction from sequencesequence
Function and domain prediction Function and domain prediction
Function prediction by gene context Function prediction by gene context
Quality and heterogeneity of dataQuality and heterogeneity of data
Prediction accuracy: 70% hurdlePrediction accuracy: 70% hurdle
Function and domain prediction Function and domain prediction
Function prediction by gene context Function prediction by gene context
Phenotypic features do not coincide with species evolution...
yeast
...but gene content does
www.bork.embl-heidelberg.de
Orthology vs paralogy
Genome A
Genome B
gene A1 gene A2
gene B1 gene B2
orthology
paralogy
genegene 2
gene 1gene A1gene B1gene A2gene B2
history
… within homology
DifferentialDifferentialGenomeGenomeDisplayDisplay
H. i
nflu
enza
e ge
nom
e
Huynen et al., 1997Trends Genet 13, 389
Exploiting the absence of genesExploiting the absence of genes
www.bork.embl-heidelberg.deHuynen et al., 1998, FEBS Lett 426, 1-5
www.bork.embl-heidelberg.de
Predicting functional interactions between proteins Predicting functional interactions between proteins by the co-occurrence of their genes in genomesby the co-occurrence of their genes in genomes
Distribution of four M.genitalium genes among 25 genomes
MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA)0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK)0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
Using the mutual information between genes as a scoring heuristic for their co-occurrence.
M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19
hscB Jac1hscB Jac1hscAhscA
ssq1ssq1
Nfu1Nfu1
iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1
Arh1Arh1
ORF1ORF1ORF2ORF2ORF3ORF3
iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2
A.aeolicus S
ynechocystis
B.subtilis
M.genitalium
M.tuberculosis
D.radiodurans
R.prow
azekii
C.crescentus
M.loti
N.m
eningitidis
X.fastidiosa
P.aeruginosa
Buchnera
V.cholerae
H.influenzae
P.multocida
E.coli
A.pernix
M.jannaschii
A.thaliana S
.cerevisiaes
C.jejuni
C.albicans
S.pom
be
H.sapiens
C.elegan
H. pylori D
.melan.
The phylogenetic The phylogenetic distribution of cyaY distribution of cyaY (frataxin) is identical (frataxin) is identical to that of hscB/Jac1, to that of hscB/Jac1, indicating a indicating a functional role of functional role of cyaY in iron-sulfur cyaY in iron-sulfur cluster assembly on cluster assembly on proteins, specifically proteins, specifically in conjunction with in conjunction with Jac1.Jac1.
Phylogenetic distribution of iron-sulfur cluster assembly proteinsPhylogenetic distribution of iron-sulfur cluster assembly proteins
cyaY Yfh1cyaY Yfh1 (frataxin)(frataxin)
Huynen et al.Hum.Mol.Genet2001
www.bork.embl-heidelberg.de
Function prediction via gene Function prediction via gene context informationcontext information
Genomic context information:Genomic context information:
- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature
- Conserved gene neighborhood in genomes - Gene fusion as distinct neighborhood subset- Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements
Knowledge-based context information:Knowledge-based context information:
www.bork.embl-heidelberg.de
Evolution of genome organizationEvolution of genome organization
Dotplot to reveal gene order conservationDotplot to reveal gene order conservation
Conservation of gene neighboorhoodConservation of gene neighboorhood
Pairwise comparison of 20 prokaryotic genomes
(time)
(log)
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo
EC-HIMG-MP I I
Nucleotide salvage/degradation Nucleotide salvage/degradation pathway in gram-positive bacteriapathway in gram-positive bacteria
TCA cycle inTCA cycle inevolutionevolution
Huynen et al., 1999, Trends Microb. 7, 281
Conservation of gene neighboorhoodConservation of gene neighboorhood
Pairwise comparison of 20 prokaryotic genomes
(time)
(log)
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo
EC-HIMG-MP I I
Varying gene neighborhood within ribosomal operons
www.bork.embl-heidelberg.de
Pathway prediction using context informationPathway prediction using context information
is essential part of the hemolysin export
exp hyp tlyC era…... M. pneumo.
M. tubercul.
E.coli
T. maritima
B. subtilis hyp tlyC era …...
…...
hyp …...
…...
…...
…...
hyp tlyC era
era
…...…... hyp tlyC era
phoL
…...
exp
exp
exp
exp tlyC
hyp tlyC
exp eraGTPaseexporter
tlyCHemolysin
phoL
phoL
phoL
phoL PhoH-like
STRING server for context retrievalSTRING server for context retrieval
Tryptophan Tryptophan biosynthesisbiosynthesis
ww
w.bork.em
bl-heidelberg.de/STRIN
Gw
ww
.bork.embl-heidelberg.de/STR
INGw
ww
.bor
k.em
bl-h
eide
lber
g.de
/STR
ING
ww
w.b
ork.
embl
-hei
delb
erg.
de/S
TRIN
G
Snel et al. NAR 28(00)3442
www.bork.embl-heidelberg.de
Homology Homology vsvs context methods: context methods: M. genitaliumM. genitalium as benchmark as benchmark
333328
MG total:MG total:480 genes480 genes
Homology-basedHomology-basedfunction:368 genesfunction:368 genes
Context-based Context-based function:238 genesfunction:238 genes
26
hypotheticalhypothetical
AdditionalAdditionalinformationinformation
www.bork.embl-heidelberg.de
Martijn (NL)
Frank* (D) Yan (C) Peer (D)
Tobias (D)
Luis* (E)
Jörg* (D)Berend (NL)Warren (US)
Miguel (E)
Shamil (RU)
Birgit* (D) Mikita (J)
Richard (UK)
Vassily (RU),Ina* (D)
Gert* (D)
+Thomas* (D), David (E), Ivica (Hr), Carolina (E), Steffen (D), Francesca (I), Jan (D)
*left EMBL
top related