exploiting basic evolutionary principles for the quality control of gene predictions lászló patthy...

43
Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium Budapest 2009. February 9.

Upload: antwan-whittle

Post on 01-Apr-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Exploiting Basic Evolutionary Principles for theQuality Control of Gene Predictions

László Patthy

Institute of Enzymology, Budapest

Darwin Day

Collegium Budapest

2009. February 9.

Page 2: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

In the last decade the genomes of numerous organisms have been sequenced, however, conversion of raw genome sequence data into biological knowledge remains a difficult task.

Genome annotation - the process that maps biological knowledge onto the relevant genome-elements - requires the definition of the positions of all protein-coding (and non-coding) genes along the genome sequence, identification of their coding regions, regulatory sequences, promoters etc.

Page 3: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Although a large number of programs have been developed for computational gene identification, correct prediction of the structure of all protein-coding genes of higher eukaryotes is still an elusive goal.

The uncertainties associated with gene finding may be illustrated by the fact that - eight years after the publication of the draft genome sequence (2001) - the exact number of protein-coding genes in the human genome is still unknown.

Page 4: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931-45.

Page 5: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

Page 6: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Since direct evidence of protein existence is generally absent, the criterion often employed to annotate a transcript as protein-coding is the existence of an Open Reading Frame (ORF). However, this criterion has been recently questioned by a number of methods developed to assess the quality of protein-coding gene annotations.

Page 7: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

The rationale of the method of Clamp et al. is that functional protein-coding genes are subject to purifying selection, and therefore they are expected to show evolutionary conservation. The authors used two types of measures for the assessment of evolutionary conservation of predicted human genes: reading frame conservation (RFC, based on the observation that indels do not affect significantly the size of functional proteins) and codon substitution frequency (CSF, based on the observation that the patterns of nucleotide substitution in functional protein-coding genes is different from that observed on random DNA).

Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

Page 8: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

In their analysis of a number of human gene reference sets, Clamp et al. identified ~1200 human “orphans”: ORFs that lack homology to known genes. Both, RFC and CSF analysis revealed that many of these human orphans exhibit a behavior which is essentially indistinguishable from matched random controls, and very different of that observed in nonorphan protein-coding genes. From these, the authors concluded that overall about 15% of the entries in the gene catalogues investigated are not valid protein-coding genes.

Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. 1: Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

Page 9: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

While the quality control method of Clamp et al. can distinguish protein-coding genes from noncoding sequences, it is less suitable to identify gene predictions that are only partially correct. If an annotated gene misses one or more exons, or a fraction of one exon, it may still exhibit the expected evolutionary characteristics of protein-coding genes.

Page 10: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Indeed, - in addition to uncertainties of the number of protein-coding genes – a very serious problem is that the structure of a significant proportion of the human genes is incorrectly predicted. According to recent analyses the predicted genomic structure of human genes is estimated to be correct for only about half of the predicted genes.

Obviously, erroneous prediction of the structure of protein-coding genes leads to serious problems in prediction of the structure and function of the proteins they encode and hinder the identification of elements that regulate their expression.

Page 11: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

A recent study has systematically compared the performance of various computational methods to predict human protein-coding genes.

A set of well-annotated ENCODE sequences were blind-analyzed with the different gene finding programs and the predictions obtained were compared with the annotations.

Predictions were analyzed at the nucleotide, exon, transcript and gene levels to evaluate how well the predictions reproduce the annotation.

Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7 Suppl 1:S2.1-31. Epub 2006 Aug 7. Review.

Page 12: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

The computational methods compared were classified as

1) EST-, mRNA-, and protein-based methods (AUGUSTUS-EST, PARAGON+NSCAN_EST, ACEVIEW, ENSEMBL, EXOGEAN, EXONHUNTER, ACEMBLY, ECGene, MGCGene)

2) single-genome ab initio methods (AUGUSTUSabinit, GENEMARKhmm, GENEZILLA, GENEID, GENESCAN)

3) dual- or multiple-genome based comparative genomic methods (AUGUST-dual, ACESCAN, DOGFISH-C, NSCAN, SAGA, MARS, SGP2, TWINSCAN)

4) complex methods using any type of available information (AUGUSTUSany, FGENESH++, JIGSAW, PARAGONany, CCDSGene, KNOWNGene, REFSEQ)

Page 13: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

At all levels, two basic measures were computed:

- sensitivity: the proportion of annotated features (nucleotide, exon, gene) that have been correctly predicted

- specificity: the proportion of predicted features that is correct.

The average sensitivity and specificity ((Sn + Sp)/2) was also calculated for each program.

Page 14: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31.

Gene feature projection for evaluation of the accuracy of predictions

missing exons

wrong exons

PR

ED

ICT

ION

KN

OW

N

Page 15: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31.

Gene prediction accuracy at the transcript level. Boxplots of the average sensitivity and specificity ((Sn + Sp)/2) for each program.

A transcript is accurately predicted if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct.

Page 16: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Guigo et al., Genome Biol. 2006;7 Suppl 1:S2.1-31.

These studies have revealed that

i) none of the strategies produced perfect predictions

ii) prediction methods that rely on mRNA and protein sequences and those that used combined informations (including expressed sequence information) were generally the most accurate.

iii) the dual- or multiple genome methods were more accurate than the single genome ab initio prediction methods.

iv) At the transcript level (the most stringent criterion) - no prediction method correctly identified greater than 45% of the coding transcripts.

Page 17: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

The MisPred project

The implicit question is: are there signs that could indicate that the predicted structure of a protein-coding gene may be incorrect?

The rationale of our MisPred project is that a protein-coding gene is suspected to be mispredicted if some of its features conflict with our current knowledge about protein-coding genes and proteins.

Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.

MisPred: Database of mispredicted and abnormal proteins; http://mispred.enzim.hu/

Page 18: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Several quality control tools of MisPred address the issue whether the predicted protein is able to reach the cellular compartment where it could be properly folded, is stable and functional.

The rationale of these tools is that protein domains have adapted to different subcellular compartments during evolution and they are usually misfolded, unstable and non-functional if mislocalized.

Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.

Page 19: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Domain co-occurrence network of metazoan multidomain proteins

Extracellular domain

Cytoplasmicsignalling domain

Nuclear domain

Tordai H, Nagy A, Farkas K, Bányai L, Patthy L. Modules, multidomain proteins and organismic complexity. FEBS J. 2005 Oct;272(19):5064-78.

As a corollary, in multidomain proteins domain-types do not co-occur at random:

- in extracellular proteins domains adapted to the extracellular milieu are used

- extracellular and intracellular domains can co-occur only in transmembrane proteins

- nuclear and extracellular domains do not co-occur in a single protein etc.

Page 20: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Some mislocalization-based MisPred tools used for the identification of abnormal or mispredicted proteins

- Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals.

- Conflict between the presence of extracellular and intracellular signaling domains and the absence of transmembrane domains.

- Co-occurrence of extracellular and nuclear domains.

Page 21: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Rationale: proteins containing domains that occur exclusively in the extracellular space (e.g. in secreted extracellular proteins or in the extracellular part of type I, type II, type III single pass transmembrane proteins or in multispanning transmembrane proteins) have a cleavable signal peptide at the N-terminal end and/or transmembrane segments.

Accordingly, proteins that contain extracellular domains but lack signal peptide and/or transmembrane segments are considered abnormal.

latrophilin-2

SP

complement factor masp-3

SP

leukocyte activation antigen m6

SP TM

TM

receptor tyrosine kinase-like orphan receptor 2

TMSP

Conflict between the presence of extracellular domains and the absence of the appropriate sequence signals.

killer cell lectin-like receptor

TM

Page 22: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

enst00000359637.1.pepUNI_TREMBL:Q8N708

ID Q8N708 PRELIMINARY; PRT; 449 AA.AC Q8N708;DT 01-OCT-2002 (TrEMBLrel. 22, Created)DT 01-OCT-2002 (TrEMBLrel. 22, Last sequence update)DT 01-MAR-2003 (TrEMBLrel. 23, Last annotation update)DE HF1 protein. . . .

SCORES Init1: 3167 Initn: 3167 Opt: 3167 z-score: 3657.9 E(): 1.1e-195>>UNI_TREMBL:Q8N708 (449 aa) initn: 3167 init1: 3167 opt: 3167 Z-score: 3657.9 expect(): 1.1e-195Smith-Waterman score: 3167; 99.5% identity in 430 aa overlap (1-430:20-449)

10 20 30 40 enst00000359 DCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG |||||||||||||||||||||||||||||||||||||||||Q8N708 MRLLAKIICLMLWAICVAEDCNELPPRRNTEILTGSWSDQTYPEGTQAIYKCRPGYRSLG 10 20 30 40 50 60

50 60 70 80 90 100 enst00000359 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Q8N708 NVIMVCRKGEWVALNPLRKCQKRPCGHPGDTPFGTFTLTGGNVFEYGVKAVYTCNEGYQL 70 80 90 100 110 120

110 120 130 140 150 160 enst00000359 LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Q8N708 LGEINYRECDTDGWTNDIPICEVVKCLPVTAPENGKIVSSAMEPDREYHFGQAVRFVCNS 130 140 150 160 170 180

170 180 190 200 210 220 enst00000359 GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Q8N708 GYKIEGDEEMHCSDDGFWSKEKPKCVEISCKSPDVINGSPISQKIIYKENERFQYKCNMG 190 200 210 220 230 240

230 240 250 260 270 280 enst00000359 YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Q8N708 YEYSERGDAVCTESGWRPLPSCEEKSCDNPYIPNGDYSPLRIKHRTGDEITYQCRNGFYP 250 260 270 280 290 300

290 300 310 320 330 340 enst00000359 ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Q8N708 ATRGNTAKCTSTGWIPAPRCTLKPCDYPDIKHGGLYHENMRRPYFPVAVGKYYSYYCDEH 310 320 330 340 350 360

350 360 370 380 390 400 enst00000359 FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNHGRKFVQGKSIDVACHPGY |||||||||||||||||||||||||||||||||||||||||:||||||||||||||||||Q8N708 FETPSGSYWDHIHCTQDGWSPAVPCLRKCYFPYLENGYNQNYGRKFVQGKSIDVACHPGY 370 380 390 400 410 420

410 420 430enst00000359 ALPKAQTTVTCMENGWSPTPRCIRVKFTL |||||||||||||||||||||||||:|||Q8N708 ALPKAQTTVTCMENGWSPTPRCIRVSFTL 430 440

Q8N708

ENSP00000352658.1

SP CORRECT

MISPREDICTED

Complement factor H, isoform b.

Page 23: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

enst00000256876.3.pepUNI_SPROT:IL2A_HUMAN

ID IL2A_HUMAN STANDARD; PRT; 272 AA.AC P01589;DT 21-JUL-1986 (Rel. 01, Created)DT 21-JUL-1986 (Rel. 01, Last sequence update)DT 01-OCT-2004 (Rel. 45, Last annotation update)DE Interleukin-2 receptor alpha chain precursor (IL-2 receptor alpha . . .

SCORES Init1: 637 Initn: 719 Opt: 637 z-score: 759.2 E(): 3.2e-34>>UNI_SPROT:IL2A_HUMAN (272 aa) initn: 719 init1: 637 opt: 637 Z-score: 759.2 expect(): 3.2e-34Smith-Waterman score: 637; 100.0% identity in 92 aa overlap (1-92:31-122)

10 20 30enst00000256 IPHATFKAMAYKEGTMLNCECKRGFRRIKS ||||||||||||||||||||||||||||||IL2A_HUMAN MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHATFKAMAYKEGTMLNCECKRGFRRIKS 10 20 30 40 50 60

40 50 60 70 80 90enst00000256 GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||IL2A_HUMAN GSLYMLCTGNSSHSSWDNQCQCTSSATRNTTKQVTPQPEEQKERKTTEMQSPMQPVDQAS 70 80 90 100 110 120

100 enst00000256 LPDFQIQTEMAATMETSI || IL2A_HUMAN LPGHCREPPPWENEATERIYHFVVGQMVYYQCVQGYRALHRGPAESVCKMTHGKTRWTQP 130 140 150 160 170 180

P01589

ENSP00000256876.3

SP TMCORRECT

MISPREDICTED

Interleukin-2 receptor alpha chain precursor

Page 24: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

enst00000248228.2.pepUNI_SPROT:209L_HUMAN

ID 209L_HUMAN STANDARD; PRT; 399 AA.AC Q9H2X3; Q969M4; Q96QP3; Q96QP4; Q96QP5; Q96QP6; Q9BXS3; Q9H2Q9;AC Q9H8F0; Q9Y2A8;DT 05-JUL-2004 (Rel. 44, Created)DT 05-JUL-2004 (Rel. 44, Last sequence update)DT 01-OCT-2004 (Rel. 45, Last annotation update) . . .

SCORES Init1: 1721 Initn: 1997 Opt: 1732 z-score: 1784.9 E(): 2.4e-91>>UNI_SPROT:209L_HUMAN (399 aa) initn: 1997 init1: 1721 opt: 1732 Z-score: 1784.9 expect(): 2.4e-91Smith-Waterman score: 2034; 82.2% identity in 399 aa overlap (1-332:1-399)

10 20 30 40 enst00000248 MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSST------------VPFLL ||||||||||||||||||||||||||||||||||||||||||| : |:|209L_HUMAN MSDSKEPRVQQLGLLEEDPTTSGIRLFPRDFQFQQIHGHKSSTGCLGHGALVLQLLSFML 10 20 30 40 50 60

50 60 70 80 90 enst00000248 --G-------PVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK | |||||||||||||||||||||||||||||||||||||||||||||||||209L_HUMAN LAGVLVAILVQVSKVPSSLSQEQSEQDAIYQNLTQLKAAVGELSEKSKLQEIYQELTQLK 70 80 90 100 110 120

100 110 120 130 140 150 enst00000248 AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||209L_HUMAN AAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQEIYQELTRLKAAVGELPEKSKLQE 130 140 150 160 170 180

160 170 180 190 enst00000248 IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAV------------------------- ||||||||||||||||||||||||||||||||||| 209L_HUMAN IYQELTRLKAAVGELPEKSKLQEIYQELTELKAAVGELPEKSKLQEIYQELTQLKAAVGE 190 200 210 220 230 240

200 210 220 230 enst00000248 ---------------------ERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR |||||||||||||||||||||||||||||||||||||||209L_HUMAN LPDQSKQQQIYQELTDLKTAFERLCRHCPKDWTFFQGNCYFMSNSQRNWHDSVTACQEVR 250 260 270 280 290 300

240 250 260 270 280 290 enst00000248 AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||209L_HUMAN AQLVVIKTAEEQNFLQLQTSRSNRFSWMGLSDLNQEGTWQWVDGSPLSPSFQRYWNSGEP 310 320 330 340 350 360

300 310 320 330 enst00000248 NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE |||||||||||||||||||||||||||||||||||||||209L_HUMAN NNSGNEDCAEFSGSGWNDNRCDVDNYWICKKPAACFRDE 370 380 390

Q9H2X3

ENSP00000248228.2

TM

CORRECT

MISPREDICTED

CD209 antigen-like protein 1

Page 25: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

CADH2_HUMAN [906 residues]

AC110015.1_002 [191 residues]

1 50cadh2_human MCRIAGALRT LLPLLLALLQ ASVEASGEIA LCKTGFPEDV YSAVLSKDVH ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

51 100cadh2_human EGQPLLN... VKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS ac110015 ~~~~MCNTQR MKFSNCNGKR KVQYESSEPA DFKVDEDGMV YAVRSFPLSS

101 150cadh2_human EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF ac110015 EHAKFLIYAQ DKETQEKWQV AVKLSLKPTL TEESVKESAE VEEIVFPRQF

151 200cadh2_human SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV ac110015 SKHSGHLQRQ KRDWVIPPIN LPENSRGPFP QELVRIRSDR DKNLSLRYSV

201 250cadh2_human TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDINGNQV ac110015 TGPGADQPPT GIFIINPISG QLSVTKPLDR EQIARFHLRA HAVDI~~~~~

251 300cadh2_human ENPIDIVINV IDMNDNRPEF LHQVWNGTVP EGSKPGTYVM TVTAIDADDP ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

301 350cadh2_human NALNGMLRYR IVSQAPSTPS PNMFTINNET GDIITVAAGL DREKVQQYTL ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

351 400cadh2_human IIQATDMEGN PTYGLSNTAT AVITVTDVND NPPEFTAMTF YGEVPENRVD ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

401 450cadh2_human IIVANLTVTD KDQPHTPAWN AVYRISGGDP TGRFAIQTDP NSNDGLVTVV ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

451 500cadh2_human KPIDFETNRM FVLTVAAENQ VPLAKGIQHP PQSTATVSVT VIDVNENPYF ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

501 550cadh2_human APNPKIIRQE EGLHAGTMLT TFTAQDPDRY MQQNIRYTKL SDPANWLKID ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

551 600cadh2_human PVNGQITTIA VLDRESPNVK NNIYNATFLA SDNGIPPMSG TGTLQIYLLD ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

601 650cadh2_human INDNAPQVLP QEAETCETPD PNSINITALD YDIDPNAGPF AFDLPLSPVT ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

651 700cadh2_human IKRNWTITRL NGDFAQLNLK IKFLEAGIYE VPIIITDSGN PPKSNISILR ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

701 750cadh2_human VKVCQCDSNG DCTDVDRIVG AGLGTGAIIA ILLCIIILLI LVLMFVVWMK ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

751 800cadh2_human RRDKERQAKQ LLIDPEDDVR DNILKYDEEG GGEEDQDYDL SQLQQPDTVE ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

801 850cadh2_human PDAIKPVGIR RMDERPIHAE PQYPVRSAAP HPGDIGDFIN EGLKAADNDP ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

851 900cadh2_human TAPPYDSLLV FDYEGSGSTA GSLSSLNSSS SGGEQDYDYL NDWGPRFKKL ac110015 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

901 cadh2_human ADMYGGGDD ac110015 ~~~~~~~~~

SP TM

CORRECT

ABNORMAL

Cadherin-2 precursor

Page 26: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Conflict between the presence of extracellular and intracellular signalling domains and the absence of transmembrane domains.

Rationale: extracellular domains and intracellular signalling domains can co-occur in multidomain proteins only if transmembrane segments separate these two types of domains.

Domain co-occurrence network of metazoan multidomain proteins

Extracellular module

Cytoplasmicsignalling module

Nuclear moduleTordai et al., FEBS J. 2005; 272(19):5064-78.

Accordingly, proteins that contain extracellular and intracellular signalling domains but lack a transmembrane segment separating them are considered abnormal.

                                                                                receptor tyrosine kinase-like orphan receptor 2

TMSPKR

Page 27: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

ENSXETP00000040601 (Xenopus tropicalis) is erroneous since it lacks a transmembrane segment although it contains both extracellular and cytoplasmic signaling domains.

Page 28: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Query 181 SDVTYRVVCKRCSWEQGECIPCANTIGYVPQQSGLVDTYISIVDLVAHANYTFEVEAVNG 240 +DVTYR++CKRCSWEQGEC+PC + IGY+PQQ+GLVD Y++++DL+AHANYTFEVEAVNGSbjct 361 NDVTYRILCKRCSWEQGECVPCGSNIGYMPQQTGLVDNYVTVMDLLAHANYTFEVEAVNG 420

Query 241 VSDLSRSQRLFAAVSVTTGQAAPSQVSGVMKERVLQRAVDLSWQEPEHPNGVITEYEIKY 300 VSDLSRSQRLFAAVS+TTGQAAPSQVSGVMKERVLQR+V+LSWQEPEHPNGVITEYEIKYSbjct 421 VSDLSRSQRLFAAVSITTGQAAPSQVSGVMKERVLQRSVELSWQEPEHPNGVITEYEIKY 480

Query 301 YEKDQRERTYSTLKTKSTSVSINNLRPGTAYIFQIRAFTAAGYGMYSPRLDVSTLEEATV 360 YEKDQRERTYST+KTKSTS SINNL+PGT Y+FQIRAFTAAGYG YSPRLDV+TLEEAT Sbjct 481 YEKDQRERTYSTVKTKSTSASINNLKPGTVYVFQIRAFTAAGYGNYSPRLDVATLEEATA 540

Query 361 YYIFA-CSYCIAYIMGSQSSLLLCLQIALQLLINSSSLYYTAALCDLNYNKSLKMHFPSG 419 + + + I + + + ++ + + +I Y+ A D ++ L HF Sbjct 541 TAVSSEQNPVIIIAVVAVAGTIILVFMVFGFIIGRRHCGYSKA--DQEGDEELYFHF--- 595

Query 420 LVKFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 479 KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKRSbjct 596 --KFPGTKTYIDPETYEDPNRAVHQFAKELDASCIKIERVIGAGEFGEVCSGRLKLPGKR 653

Query 480 DVPVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEFMENGA 539 DV VAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIE+MENGASbjct 654 DVAVAIKTLKVGYTEKQRRDFLCEASIMGQFDHPNVVHLEGVVTRGKPVMIVIEYMENGA 713

Query 540 LDAFLRKLDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 599 LDAFLRK DGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFGSbjct 714 LDAFLRKHDGQFTVIQLVGMLRGIAAGMRYLADMGYVHRDLAARNILVNSNLVCKVSDFG 773

Query 600 LSRIIEDDPDAVYTTTQGGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 659 LSR+IEDDP+AVYTTT GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPYSbjct 774 LSRVIEDDPEAVYTTT-GGKIPVRWTAPEAIQYRKFTSASDVWSYGIVMWEVMSYGERPY 832

Query 660 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 719 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNPSbjct 833 WDMSNQDVIKAIEEGYRLPAPMDCPAGLHQLMLDCWQKERGERPKFEQIVGILDKMIRNP 892

Query 720 NSLKTPMGTCNRPTSPLLDQNTLDFNSFCSVGEWLEAIKMERYKENFSSSGYNSLESVAR 779 NSLKTP+GTC+RP SPLLDQNT DF +FCSVGEWL+AIKMERYK+NF+++GYNSLESVARSbjct 893 NSLKTPLGTCSRPISPLLDQNTPDFTTFCSVGEWLQAIKMERYKDNFTAAGYNSLESVAR 952

Query 780 MSIDDVISLGITLVGHQKKIMNSIQTMRAQMLQLHGTGI 818 M+I+DV+SLGITLVGHQKKIM+SIQTMRAQML LHGTGISbjct 953 MTIEDVMSLGITLVGHQKKIMSSIQTMRAQMLHLHGTGI 991

The chicken ortholog of ENSXETP00000040601 (Xenopus tropicalis), EPHA7_CHICK Ephrin type-A receptor 7 (np_990414), does contain a transmembrane segment

ENSXETP00000040601 (Xenopus tropicalis) deviates most significantly in this region from EPHA7_CHICK

Page 29: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

The erroneous part of ENSXETP00000040601 (Xenopus tropicalis) could be corrected, by identifying the exons encoding the ‘missing’ transmembrane segment.

451 500

ensxetp00000040601_corrected KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV

np_990414 KERVLQRSVE LSWQEPEHPN GVITEYEIKY YEKDQRERTY STVKTKSTSA

ensxetp00000040601 KERVLQRAVD LSWQEPEHPN GVITEYEIKY YEKDQRERTY STLKTKSTSV

501 550

ensxetp00000040601_corrected SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATA TAVSTEQNPV

np_990414 SINNLKPGTV YVFQIRAFTA AGYGNYSPRL DVATLEEATA TAVSSEQNPV

ensxetp00000040601 SINNLRPGTA YIFQIRAFTA AGYGMYSPRL DVSTLEEATV YYIFACSYCI

551 600

ensxetp00000040601_corrected IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHC...

np_990414 IIIAVVAVAG TIILVFMVFG FIIGRRHCGY SKA..DQEGD EELYFHF...

ensxetp00000040601 AYI.MGSQSS LLLCLQIALQ LLINSSSLYY TAALCDLNYN KSLKMHFPSG

601 650

ensxetp00000040601_corrected ......TKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC

np_990414 ..KFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC

ensxetp00000040601 LVKFPGTKTY IDPETYEDPN RAVHQFAKEL DASCIKIERV IGAGEFGEVC

651 700

ensxetp00000040601_corrected SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE

np_990414 SGRLKLPGKR DVAVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE

ensxetp00000040601 SGRLKLPGKR DVPVAIKTLK VGYTEKQRRD FLCEASIMGQ FDHPNVVHLE

Page 30: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Co-occurrence of extracellular and nuclear domains.

Rationale: nuclear domains do not co-occur with extracellular domains in multidomain proteins. Accordingly, proteins that contain both extracellular and nuclear domains are considered abnormal.

Domain co-occurrence network of metazoan multidomain proteins

Extracellular module

Cytoplasmicsignalling module

Nuclear module

Tordai et al., FEBS J. 2005; 272(19):5064-78.

Page 31: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

YL15_CAEEL

1 50q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ yl15_caeel MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTKRTTFTS

51 100q619j1_caebr ~~~~~~~~~~ ~~~~~~~~~~ MFVWSAAVLI FSSVVPTFAQ YGCI....SE yl15_caeel EQVTLLELEF AKNEYICKDR RGELAQTIEL TECQVKTWFQ NRRTKKRSSE

101 150q619j1_caebr LTFGKACPQN KTSTKWFFDA KLSFCYPYQF LGCDEGSNSF ESSDICLESC yl15_caeel LKFGTACSEN KTSTKWYYDS KLLFCYPYKY LGCGEGSNSF ESNENCLESC

151 200q619j1_caebr KPADQFSCGG NTDADGICFS PSDSGCKKGT DCVMGGNIGF CCNKATQDEW yl15_caeel KPADQFSCGG NTGPDGVCFA HGDQGCKKGT VCVMGGMVGF CCDKKIQDEW

201 250q619j1_caebr NKEHSPTCSK GSVVQFKQWF GMTPLIGRNC AHKFCPAGST CIQGKWTAHC yl15_caeel NKENSPKCLK GQVVQFKQWF GMTPLIGRSC SHNFCPEKST CVQGKWTAYC

251 q619j1_caebr CQ yl15_caeel CQ

Homeobox Kunitz_BPTI

Mispredicted protein (Swiss-Prot entry) containing nuclear and extracellular domains

Exons belonging to tandem genes on C. elegans chromosome X have been incorrectly joined

MISPREDICTED

Hypothetical homeobox protein C02F12.5 in chromosome X

Page 32: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

1 50yl15_caeel_corr MTSKTNMTSN KFAYDFFPWS NDTNSSQQIK NIKPPPKRSN RPTK.RTTFT hm07_caeel ~~~~~~~~~~ ~~~~~~~~MK HEMVFTFLLM MVRPEASTSR IPRR.RTTFT q7qbz2_anoga ~~~~~~~~~~ ~~~~MLFTTS YSRNKPTNNS NVARRRKKEG RPRRQRTTFS

51 100yl15_caeel_corr SEQVTLLELE FAKNEYICKD RRGELAQTIE LTECQVKTWF QNRRTKKRSF hm07_caeel VEQLYLLEMY FAQSQYVGCD ERERLARILS LDEYQVKIWF QNRRIRMRRE q7qbz2_anoga SEQTLRLEVE FHRNEYISRG RRFELAEVLK LSETQIKIWF QNRRAKDKRI

101 115yl15_caeel_corr I*~~~~~~~~ ~~~~~ hm07_caeel ANK~~~~~~~ ~~~~~ q7qbz2_anoga EKAQIDQQYR SVRIK

1 50 q619j1_caebr MFVWSAAVLI FSSVVPTFAQ YGCISELTFG KACPQNKTST KWFFDAKLSF yl15_caeel_corr1 MLFFTLLIQL F..LVPVLCQ YACSSELKFG TACSENKTST KWYYDSKLLF

51 100 q619j1_caebr CYPYQFLGCD EGSNSFESSD ICLESCKPAD QFSCGGNTDA DGICFSPSDS yl15_caeel_corr1 CYPYKYLGCG EGSNSFESNE NCLESCKPAD QFSCGGNTGP DGVCFAHGDQ

101 150 q619j1_caebr GCKKGTDCVM GGNIGFCCNK ATQDEWNKEH SPTCSKGSVV QFKQWFGMTP yl15_caeel_corr1 GCKKGTVCVM GGMVGFCCDK KIQDEWNKEN SPKCLKGQVV QFKQWFGMTP

151 178 q619j1_caebr LIGRNCAHKF CPAGSTCIQG KWTAHCCQyl15_caeel_corr1 LIGRSCSHNF CPEKSTCVQG KWTAYCCQ

Homeobox

Kunitz_BPTI

Corrected predictions for the distinct constituent proteins containing the nuclear homeobox and extracellular KUNITZ_BPTI domains

CORRECT

CORRECT

Page 33: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Another MisPred tool detects errors in gene prediction based on ‘Domain size deviation’.

The rationale of this tool is that the highly cooperative, rapid folding of protein domains is the result of natural selection , therefore insertion/deletion of larger segments into/from protein domains may yield macromolecules that are unable to rapidly adopt a correctly folded, viable and stable three-dimensional structure.

Accordingly, proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of the same family may be suspected to be unable to fold efficiently into a correctly folded, viable and stable domain/protein.

Page 34: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

RP11-247A12.5-001 [544 aa]

CACP_HUMAN, Carnitine O-acetyltransferase [626 residues]

1 100 cacp_human MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE rp11-247a12 MLAFAARTVV KPLGFLKPFS LMKASSRFKA HQDALPRLPV PPLQQSLDHY LKALQPIVSE EEWAHTKQLV DEFQASGGVG ERLQKGLERR ARKTENWLSE    101 200 cacp_human WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP rp11-247a12 WWLKTAYLQY RQPVVIYSSP GVMLPKQDFV DLQGQLRFAA KLIEGVLDFK VMIDNETLPV EYLGGKPLCM NQYYQILSSC RVPGPKQDTV SNFSKTKKPP   201 300 cacp_human THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ KSIFTVCLDA TMPRVSEDVY rp11-247a12 THITVVHNYQ FFELDVYHSD GTPLTADQIF VQLEKIWNSS LQTNKEPVGI LTSNHRNSWA KAYNTLIKDK VNRDSVRSIQ .......... ..........   301 400 cacp_human RSHVAGQMLH GGGSRLNSGN RWFDKTLQFI VAEDGSCGLV YEHAAAEGFP IVTLLDYVIE YTKKPELVRS PMVPLPMPKK LRFNITPEIK SDIEKAKQNL rp11-247a12 .......... .......... .......... .......... .......... .......... ..KKPELVRS PLVPLPMPKK LRFNITPEIK SDIEKAKQNL   401 500 cacp_human SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV rp11-247a12 SIMIQDLDIT VMVFHHFGKD FPKSEKLSPD AFIQMALQLA YYRIYGQACA TYESASLRMF HLGRTDTIRS ASMDSLTFVK AMDDSSVTEH QKVELLRKAV   501 600 cacp_human QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA rp11-247a12 QAHRGYTDRA IRGEAFDRHL LGLKLQAIED LVSMPDIFMD TSYAIAMHFH LSTSQVPAKT DCVMFFGPVV PDGYGVCYNP MEAHINFSLS AYNSCAETNA   601 626 cacp_human ARLAHYLEKA LLDMRALLQS HPRAKLrp11-247a12 ARLAHYLEKA LLDMRALLQS HPRAKL

CORRECT

ABNORMAL

RP11-247A12.5-001 encodes an internally deleted Carn_acyltransf domain

Region missing from RP11-247A12.5-001

Page 35: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

STRUCTURE OF HUMAN CARNITINE ACETYLTRANSFERASE 1NM8.pdb

His 343

Three-dimensional structure of human carnitine O-acetyltransferase.

1NM8.pdb

The region highlighted in yellow is missing from transcript RP11-247A12.5-001. This region also contains the catalytic residue His-343

Page 36: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

1 50

epha5_human MRGSGPRGAG HRRPP..SGG GDTPITPASL AGCYSAPRRA PLWTCLLLCA

epha5_rat MRGSGPRGAG RRRTQGRGGG GDTPRVPASL AGCYSAPLKG PLWTCLLLCA

epha5_chick M...GLRGGG .....GRAGG ......PA.. .......... PGWTCLLLCA

epha5_mouse MRGSGPRGAG HRRTQGRGGG DDTPRVPASL AGCYSAPLKG PLWTCLLLCA

51 100

epha5_human ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT

epha5_rat ALRTLLASPS NEVNLLDSRT VLGDLGWIAF PKNGWEEIGE VDENYAPIHT

epha5_chick ALRSLLASPG SEVNLLDSRT VMGDLGWIAY PKNGWEEIGE VDENYAPIHT

epha5_mouse ALRTLLASPS NEVNLLDSRT VMGDLGWIAF PKNGWEEIGE VDENYAPIHT

101 150

epha5_human YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK

epha5_rat YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK

epha5_chick YQVCKVMEQN QNNWLLTSWI SNEGRPASSF ELKFTLRDCN SLPGGLGTCK

epha5_mouse YQVCKVMEQN QNNWLLTSWI SNEGASRIFI ELKFTLRDCN SLPGGLGTCK

151 200

epha5_human ETFNMYYFES DDQNGRNIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE

epha5_rat ETFNMYYFES DDENGRNIKD NQYIKIDTIA ADESFTELDL GDRVMKLNTE

epha5_chick ETFNMYYFES DDEDGRNIRE NQYIKIDTIA ADESFTELDL GDRVMKLNTE

epha5_mouse ETFNMYYFES DDENGRSIKE NQYIKIDTIA ADESFTELDL GDRVMKLNTE

201 250

epha5_human VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT

epha5_rat VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAVFPDTIT

epha5_chick VRDVGPLTKK GFYLAFQDVG ACIALVSVRV YYKKCPSVIR NLARFPDTIT

epha5_mouse VRDVGPLSKK GFYLAFQDVG ACIALVSVRV YYKKCPSVVR HLAIFPDTIT

251 300

epha5_human GADSSQLLEV SGSCVNHSVT DEPPKMHCSA EGEWLVPIGK CMCKAGYEEK

epha5_rat GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK

epha5_chick GADSSQLLEV SGVCVNHSVT DEAPKMHCSA EGEWLVPIGK CLCKAGYEEK

epha5_mouse GADSSQLLEV SGSCVNHSVT DDPPKMHCSA EGEWLVPIGK CMCKAGYEEK

301 350

epha5_human NGTCQVCRPG FFKASPHIQS CGKCPPHSYT HEEASTSCVC EKDYFRRESD

epha5_rat NGTCQVCRPG FFKASPHSQT CSKCPPHSYT HEEASTSCVC EKDYFRRESD

epha5_chick NNTCQVCRPG FFKASPHSPS CSKCPPHSYT LDEASTSCLC EEHYFRRESD

epha5_mouse NGTCQ..... .......... .......... .......... ..........

351 400

epha5_human PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC

epha5_rat PPTMACTRPP SAPRNAISNV NETSVFLEWI PPADTGGGKD VSYYILCKKC

epha5_chick PPTMACTRPP SAPRSAISNV NETSVFLEWI PPADTGGRKD VSYYIACKKC

epha5_mouse .......... .......... .......... .......... ..........

401 450

epha5_human NSHAGVCEEC GGHVRYLPRQ SGLKNTSVMM VDLLAHTNYT FEIEAVNGVS

epha5_rat NSHAGVCEEC GGHVRYLPQQ IGLKNTSVMM ADPLAHTNYT FEIEAVNGVS

epha5_chick NSHSGLCEAC GSHVRYLPQQ TGLKNTSVMM VDLLAHTNYT FEIEAVNGVS

epha5_mouse .......... .......... .......... .......... ..........

451 500

epha5_human DLSPGARQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI

epha5_rat DLSPGTRQYV SVNVTTNQAA PSPVTNVKKG KIAKNSISLS WQEPDRPNGI

epha5_chick DQNPGARQFV SVNVTTNQAA PSPVSSVKKG KITKNSISLS WQEPDRPNGI

epha5_mouse .......... .........A PSPVTNVKKG KIAKNSISLS WQEPDRPNGI

501 550

epha5_human ILEYEIKHFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY

epha5_rat ILEYEIKYFE KDQETSYTII KSKETTITAE GLKPASVYVF QIRARTAAGY

epha5_chick ILEYEIKYFE KDQETSYTII KSKETAITAD GLKPGSAYVF QIRARTAAGY

epha5_mouse ILEYEIKYFE KDQETSYTII KSKETSITAE GLKPASVYVF QIRARTAAGY

551 600

epha5_human GVFSRRFEFE TTPV.FAASS DQSQIPVIAV SVTVGVILLA VVIGVLLSGS

epha5_rat GVFSRRFEFE TTPV.FGASN DQSQIPIIGV SVTVGVILLA VMIGFLLSGS

epha5_chick GGFSRRFEFE TSPV.LAASS DQSQIPIIVV SVTVGVILLA VVIGFLLSGS

epha5_mouse GVFSRRFEFE TTPVSVAASN DQSQIPIIAV SVTVGVILLA VMIGFLLSGS

601 650

epha5_human CCECGCGRAS SLCAVAHPIL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV

epha5_rat CCECGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV

epha5_chick CCDHGCGWAS SLRAVAYPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV

epha5_mouse CCDCGCGRAS SLCAVAHPSL IWRCGYSKAK QDPEEEKMHF HNGHIKLPGV

651 700

epha5_human RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLP

epha5_rat RTYIDPHTYE DPTQAVHEFG KEIEASCITI ERVIGAGEFG EVCSGRLKLP

epha5_chick RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGRLKLQ

epha5_mouse RTYIDPHTYE DPNQAVHEFA KEIEASCITI ERVIGAGEFG EVCSGCLKLP

701 750

epha5_human GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK

epha5_rat GKRELPVATK TLKVGYTEKQ RRDFLSEASI MGQFDHPNII HLEGVVTKSK

epha5_chick GKREFPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK

epha5_mouse GKRELPVAIK TLKVGYTEKQ RRDFLGEASI MGQFDHPNII HLEGVVTKSK

751 800

epha5_human PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGISAG MKYLSDMGYV

epha5_rat PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV

epha5_chick PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIASG MKYLSDMGYV

epha5_mouse PVMIVTEYME NGSLDTFLKK NDGQFTVIQL VGMLRGIAAG MKYLSDMGYV

801 850

epha5_human HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP

epha5_rat HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP

epha5_chick HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP

epha5_mouse HRDLAARNIL INSNLVCKVS DFGLSRVLED DPEAAYTTRG GKIPIRWTAP

851 900

epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL

epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

901 950

epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN

epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN

epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN

epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN

951 1000

epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV

epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV

epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV

epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV

1001 1041

epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L

epha5_rat AQVTLE.... .......... .......... .......... .

epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L

epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP VEPHA5_RAT ephrin type-a receptor 5 precursor   [1005 residues]

EPHA5_HUMAN ephrin type-a receptor 5 precursor   [1037 residues]

EPHA5_RAT contains a C-terminal truncated SAM_1 domain, although not annotated as fragment by SwissProt. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain.

Page 37: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

851 900

epha5_rat_corrected EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

epha5_rat EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

epha5_human EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

epha5_chick EAIAFRKFTS ASDVWSYGIV MWEVMSYGER PYWEMTNQDV IKAVEEGYRL

epha5_mouse EAIAFRKFTS ASDVWSYGIV MWEVVSYGER PYWEMTNQDV IKAVEEGYRL

901 950

epha5_rat_corrected PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN

epha5_rat PSPMDCPAAL YQLMLDCWQK DRNSRPKFDD IVNMLDKLIR NPSSLKTLVN

epha5_human PSPMDCPAAL YQLMLDCWQK ERNSRPKFDE IVNMLDKLIR NPSSLKTLVN

epha5_chick PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVSMLDKLIR NPSSLKTLVN

epha5_mouse PSPMDCPAAL YQLMLDCWQK DRNSRPKFDE IVNMLDKLIR NPSSLKTLVN

951 1000

epha5_rat_corrected ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV

epha5_rat ASSRVSTLLA EHGSLGSGAY RSVGEWLEAT KMGRYTEIFM ENGYSSMDAV

epha5_human ASCRVSNLLA EHSPLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV

epha5_chick ASSRVSNLLV EHSPVGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDSV

epha5_mouse ASSRVSTLLA EHGSLGSGAY RSVGEWLEAI KMGRYTEIFM ENGYSSMDAV

1001 1042

epha5_rat_corrected AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP V*

epha5_rat AQVTLE~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~

epha5_human AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~

epha5_chick AQVTLEDLRR LGVTLVGHQ. KKIMNSLQEM KVQLVNGMVP L~

epha5_mouse AQVTLEDLRR LGVTLVGHQK KKIMSSLQEM KVQMVNGMVP V~corrected

Page 38: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Conclusions from MisPred analyses of various databases

The number of UniProtKB/Swiss-Prot entries identified by MisPred as erroneous is very low, attesting to both the high quality of this manually curated database and the reliability of the MisPred approach.

In the case of UniProtKB/TrEMBL MisPred identified a large proportion of TrEMBL entries as erroneous, the majority of which were missing signal peptides or suffered from domain size deviation. This is due primarily to the fact that these TrEMBL entries are translated in silico from non-full length cDNAs.

In the case of the EnsEMBL- and NCBI/GNOMON-predicted sequences MisPred identified ~3-4 % of human sequences as erroneous. The majority of errors were also identified on the basis of missing signal peptides and domain size deviation, probably reflecting the influence of non-full-length or abnormal cDNAs on gene predictions.

Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008 Aug 27;9:353.

Page 39: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Application of the MisPred tools to GENCODE peptides revealed that many of the potential alternative gene products encode proteins that are likely to be mislocalized and/or misfolded, suggesting that they do not have a role as functional proteins.

Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, López G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis SE, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones DT, Lengauer T, Orengo CA, Patthy L, Thornton JM, Tramontano A, Valencia A.The implications of alternative splicing in the ENCODE protein complement.Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500. Epub 2007 Mar 19.

Conclusions from MisPred analyses of various databases

Page 40: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Tress et al., Proc Natl Acad Sci U S A. 2007 Mar 27;104(13):5495-500.

Page 41: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

Although large scale whole genome analyses have shown that mammalian transcriptomes are made of a swarming mass of different overlapping transcripts, little evidence exists that the majority of this transcript complexity leads to protein complexity.

The 5.7 average transcripts per coding locus annotated in GENCODE translates only to 1.7 proteins per locus (since a large fraction of transcript variation corresponds to non-coding transcripts or accumulates in the UTRs of coding transcripts). Moreover, if the GENCODE proteins flagged as problematic by the protein assessment methods, such as MisPred, are ignored, there are barely 1.3 annotated proteins per locus.

The discrepancy between a complex, variable and largely unexplored population of RNA molecules, and a relatively small, stable, and well defined population of proteins, constitutes one of the challenges that Molecular Biology needs to address to fully elucidate cellular function.

Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE and Guigó R. Identifying protein-coding genes in genomic sequences. Genome Biology 2009, 10:201

Page 42: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium

László Bányai Krisztina Farkas Hédi Hegyi Evelin KozmaAlinda Nagy Hedvig Tordai

This work was carried out as part of the BioSapiens project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2003-503265.

The authors thank the partial support of the National Office for Research and Technology under grant no.: eScience RET14/2005.

Page 43: Exploiting Basic Evolutionary Principles for the Quality Control of Gene Predictions László Patthy Institute of Enzymology, Budapest Darwin Day Collegium