elements of bioinformatics (14f001) tp2: gene prediction 22 october 2012 corrections

86
Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Upload: graceland

Post on 20-Mar-2016

21 views

Category:

Documents


2 download

DESCRIPTION

Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS. Notice: During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats. For additional information on the different sequence formats available, please have a look at - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Elements of Bioinformatics (14F001)

TP2: Gene prediction22 October 2012

CORRECTIONS

Page 2: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Notice:

During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats.

For additional information on the different sequence formats available, please have a look athttp://www.genomatix.de/online_help/help/sequence_formats.html

Page 3: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

nc RNA gene prediction

Page 4: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Choose: eukaryotic tRNA; does not give any result with general tRNA model !

Page 5: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 6: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 7: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

CpG island prediction

Page 8: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

CpG island in the C. Elegans cosmid

Lenght 219 pb; position 21’954 to 22’172

cgttttctgtggtcaca cacgagtatc cggatcttct ggatcaactt gttctcgtct gcaacgtctt tgcaagaatg gcaccagaac agaaacaact actcgtggaa caccttcaag acgttgggca gacggtcgct atgtgtggcg atggagctaa tgattgtgct gctctgaaag cagctcacgc gggaatctca ctatcggagg ctgaagcatc ga

To confirm that this sequence could be part of a promoter sequence (> 80 % of CpG islands extend in the 5’ flanking region of the associated genes), check - according to its positions - if this CpG island is located in a gene promoter region(see later).

Page 9: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene prediction

with HMM on the complete cosmid sequence

Page 10: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1

Gene 2

Gene 3

Gene 4

Wrong CDS ?

3 HMM models: firstex, exon_n, lastex

Page 11: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

1

4

32

tRNA 169 238

Predicted CpG island: 21954 22172 -> in the middle of CDS4: not a ‘classical’ CpG (not in the 5’ of a gene)

Summary:

Page 12: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1

Page 13: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with HMMgene

One gene found

Page 14: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with HMMgene

With ‘human’: 2 genes found, one on each strand, (strand minus with less good scores)The programs are ‘trained’ with sequence from specific organisms. The ‘codon bias’ for example, is not the same for the different species.

Page 15: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Example of codon usage tables (-> codon bias)http://www.kazusa.or.jp/codon/

Page 16: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with Netgene2

Netgene 2 gives the positions of the first and last nucleotide of the intron (donnor and acceptor splice sites)

GTdonnor

AG

acceptor

intron

Page 17: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with GeneBuilder(organism: no choice….human; option: first and last exon disabled)

Matrix: miscellaneous

One gene found

Page 18: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with GenScan!! No choice except: vertebrate, maize and arabidobsis !

Two genes found

Page 19: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

!! No choice except: vertebrate, maize and arabidobsis !

Two genes found

Page 20: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

FGENESH

One gene found

Page 21: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Summary (gene prediction)

3 ’5 ’

108310031305

14061452 1661

2000

DO 1084 (1.00)

AC 1304 (0.77)

DO 1407 (0.89)

AC 1451 (0.90)

DO 1662 (1.00)

AC 1913 (1.00)

HMMgene Genebuilder Netgene2 DO:donnor site AC: acceptor site

19141997

and GenScan (organism = human !!)

1557

(organism = human !!)

977

GeneMark: finds a second gene in 3’!!!

163211

FGENESH

+ another potential genefrom positions 2000 to 2900

One gene

Page 22: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

ID FGENESH Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64;

MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK

AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//

ID GENESCAN1 Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64;

MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK

AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//

ID GENESCAN2 Unreviewed; 202 AA.SQ SEQUENCE 202 AA; 23684 MW; 98A69FA21823F2F3 CRC64;

MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDKL

VSDKIKLFRE HKILRIRSVQ HI//

ID GENEMARK1 Unreviewed; 184 AA.SQ SEQUENCE 184 AA; 20255 MW; 85BB0234E6C14EA0 CRC64;

MGRCGSSGKR DGYGAKDSSS EGLSTMKVET CVYSGYKIHP GHGKRLVRTD GKVQIFLSGK ALKGAKLRRN PRDIRWTVLY RIKNKKGTHG QEQVTRKKTK KSVQVVNRAV AGLSLDAILA KRNQTEDFRR QQREQAAKIA KDANKAVRAA KAAANKEKKA SQPKTQQKTA KNVKTAAPRV

GGKR//

ID GENEMARK2 Unreviewed; 183 AA.SQ SEQUENCE 183 AA; 21336 MW; 64F65D472A58046E CRC64;

MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDNV

QHI//

Page 23: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 24: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

For fun…

Compare the predictions with the same program (GenMark) with different

parameters (HMM trained with eukaroyta or prokaroyta)

Page 25: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 26: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 27: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Two genes found

Page 28: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 prediction with GeneMark (prokaryota specific; E.coli K12)

Protein 1Protein 2

Page 29: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Protein 1

Protein 2

Gene 1 prediction with GeneMark (prokaryota specific)

CDS corresponds ~ to ‘exon’ : there is no intron in prokaryota !

Page 30: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Summary (prokaryota gene prediction)

3 ’5 ’

108310031305

14061452

1661

2000DO

1084 (1.00)

AC 1304 (0.77)

DO 1407 (0.89)

AC 1451 (0.90)

DO 1662 (1.00)

AC 1913 (1.00)

HMMgene Genebuilder Netgene2

DO:donnor site

AC: acceptor site

1914 1997

GenScan

1437 1688

Gene Mark (proka)

1254 1433Protein 1Protein 2

1557

GenMark (euka)

Page 31: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Alignment between the ‘eukaryota and prokaryota’ predicted sequences

Page 32: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene prediction: similarity searches with ESTs

ESTs: Expressed sequence tags (cDNAs which are rapidly and badly sequenced)

Page 33: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 34: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Blast 2012

Gene A Gene B

Two genes found

Page 35: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Blast 2010

Gene A Gene B

Page 36: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

EST1  >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA      EST2 >gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA  

EST3

>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA

Gene A

Page 37: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

975-1407 1450-1615 1692-1865

Blast result with EST1

BUT: Blast does not take care of the intron-exon boundaries when aligning DNA with RNA -> we have to use a specific tool : SIM4

The 3rd part of the EST1 is of very bad quality

Page 38: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

SIM4 alignment

Example withEST 1 BJ750997

(partial)

The 3rd part of the EST1 is of very bad quality: not align by SIM4 -> EST1 is considered as partial !

Page 39: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

EST 3 BJ818152

SIM4 alignment results

EST 1 BJ750997(partial)

EST 2 BJ775052

Page 40: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

summary (ESTs)

3 ’5 ’

108310031305

14061452

1661

1914 1997

1615EST1BJ750997.1

EST2 BJ775052.1

EST3 BJ818152.1

Alternative splicing event (intron retention)-> 2 different mRNAs

(EST BJ750997.1 is partial)

Gene A

Page 41: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Translation and BLASTpTranslation

(beware the EST sequence orientation !)

Page 42: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

>gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA   

EST1

Page 43: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ

VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIA

Blastp results

Page 44: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 45: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

>gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA  

EST2

Page 46: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPK

TQQKTAKNVKTAAPRVGGKR

Blastp results

Page 47: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 48: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA EST3

Page 49: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

EST1 is partial in C-ter

Gene A

Page 50: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

EST1 is partial.EST3 corresponds to the UniProtKB/Swiss-Prot RL24_CAEEL sequence

Gene A

Page 51: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Some prediction programs give the correct protein sequenceNone have predicted the alternative splicing event (EST2; intron 1084-1304 retention)

Gene A

Page 52: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

summary (ESTs)

3 ’5 ’

108310031305

14061452

1661

1914 1997

EST BJ775052.1

EST BJ818152

Alternative splicing events (intron retention)-> 2 different mRNAs

MKVET…..1010

MIYLF…..1284

Gene A

Page 53: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 54: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 1 is on C.elegans chromosome I

Page 55: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

BLAT results

Isoform 2EST2

Gene BGene A

Page 56: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

>NP_491399 length=159 MKVETCVYSGYKIHPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIR WTVLYRIKNKKGTHGQEQVTRKKTKKSVQVVNRAVAGLSLDAILAKRNQT EDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPKTQQKTAKNVKT AAPRVGGKR

RefSeq sequence

Page 57: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

InterPro scan results: the protein contains a ribosomal L24e domain

Page 58: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 59: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Conclusions (1)

There are 2 different protein sequences due to alternative splicing (intron retention; the shortest isoform is due to a intron retention and is rarely expressed – only 2 ESTs)

Gene A

Page 60: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Conclusions (2)

Gene prediction programs can not predict an alternative splicing event(it can only predict the alternative splice junction)

The protein (Gene A) is a ribosomal protein which belongs to the ribosomal protein L24e family (UniProtKB/Swiss-Prot O01868).

The alternatively spliced sequence is not yet in the protein sequence databases, because it is ‘derived’ from ESTs sequenceswhich are submitted to public DNA/RNA databases without annotated CDS

Page 61: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Non coding region analysis

Page 62: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

3’end of chromosome Y EMBL #AJ271736

Page 63: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Example of Alu sequence

Page 64: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 2

Page 65: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Schema recapitulatif

5 ’3 ’

11117891410 1636

1688 1845

AC 1112 (0.56)

DO 1409 (0.92)

DO 1556 (0.96)

AC 1637 (0.61)

HMMgene

Netgene2DO:donneur AC: accepteur

5 ’ 3 ’

1557 Exon 1Exon 2Exon 3

Page 66: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 67: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

1112 1407 1637 1688

Page 68: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 69: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 70: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

GeneBuilder prediction is not confirmed anywhere else

Page 71: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

CDS2 (3 exons)

RefSeq NP_491393 (AF272397)UniProtKB/TrEMBL: G5EC89

237 AA; 3 exonsMMMEYGGYFS SSAVAQQSGD VPTTAPSAVT NSFFYTPQSH NIYHQYATPY LQSGRALTTA HNTSSSSAGN STSSSSSSSN YRNTTHDSLQ AFFNTGLQYQ LYQKSQLIGS DTIQRTSSNV LNGLPRSSLV GALCSTGGAP LNPAERRKQR RIRTTFTSGQ LKELERSFCE THYPDIYTRE EIAMRIDLTE ARVQVWFQNR RAKYRKQEKI RRVKDEEEDP LKKEPGQISL EEIIDQI

A probable nuclear protein with a DNA binding domain (homeobox)

Page 72: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 3

Page 73: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Numérotation « direct strand »

Page 74: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 75: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 76: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

CDS3

>tr|O01864|O01864_CAEEL Hypothetical protein - Caenorhabditis elegans. METEVMKSFNNELSSLFDSKNMSKNKIQDITKAAIKAKSQYKHVVFSVEKLINKCKPDQR LNVLYVIDSIVRASKHQLKEKDTFGPRFMKQFDKFLMPLLKCGQKEKMRTVRTLNLWMSN KVFKESEIQPLREMCKASGLTIDFEEVELAVKGKQADMSIYSGVYKKKPKRSSSSSQPKS RTPTNPHPDDGLLGAGPSSALRSVPDIPNFVLSEDYFLGTISEREMLELVQKFGIDRSGV LSKDKNLLQRALQIFAGSLSQKVEEVLAENNRINGSSIQNVLTKDFEYSDDEEEKEKEPQ PEKQKNLPHAQVLLLAQSLLTQPQILAKLAEVLIPQGNPFGLPFPGEHIVPTSSAALTLG APPPNLMALQQSLPPGFPNQQLGLPNLSGLNQAQLMNVQNAQNMLQLQQRAAQLQALQGN PNAQRNLLMLGNPLLNPFALQHGVNPMLNDLQAAAAAQQQAMLNEAAQSPEKKILELSGG NSGINNSGDVERARLREKEKERESKERRRMGLPPVRIGFTIIASRTLWLKKIPTNIVEND LKQAVESCGEASRVKVIGNRACAYITMENRRSANDVVSKMREVSVAKKMVKVYWARSPGM DSDQFSDLWDSNRGVLEIPYEKLPLDLVALCEGAMLDIESLPIEKKLLYKETGETVISIP PPNIQPPVPHPPPMGFPFQHQLTQLPGQPRPAGLPPGVPPMFNLNAPPPPGIPGYPPAPP PPGVGPPPPQGIPPMGFDPNKPPPPMFQQGFNAGAPPPPFGRGAGPMSSFPPPPRGGMHH MPPPPSFRGGRGGHGGPPPPHFDRRGGGGPPFRPENGRGRLLDQSEMWNREQREMRGGGG AGRDGGREHRDYDRDRSQIDRRRQDDMGARRRSRWGDDDRRDDDRRDDRRDDRRESRRRS PRSPRSPDRRTRRSPSYEREEPPVKKTSVEEETVSSTTLDELKPSVEPTPVPAPIPAPAP

ELKAAEEPVKIVAEHHEDQTDEVPMDLE

Page 77: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Gene 4

Page 78: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Removed from gene 4:1412-1691, 1795-5682, 5842-6048, 6865-6907, 7133-7413,7518-7589, 7754-7999, 7912-7958, 8154-8222, 8414-8496,8660-8709, 9043-9114, 9529-9573, 9706-9769, 9943-9996

EST HMMgene WebGene Netgene2

1346 1411 (AG) (GT)1695 1794 1691 1795

5405 54495679 5841 5668 5859 5683 5841 5682 58426049 6080 6049 6864 6049 6864 6048 68656908 6993 6908 7132 6908 7132 6907 7133

7187 7328 7187 7328 7186 73297411 7520 7414 7517 7414 7517 7413 7518

7564 75897959 8153 7958 8154

7589 7753 7589 77547800 7911 7800 7911 7799 79127954 8113 7959 8135

8223 8413 8223 8413 8222 84148497 8659 8497 8659 8496 86608710 9042 8710 9042 8709 90439115 9528 9115 9528 9114 9529

9631 9705 9574 9705 9574 9705 9573 97069770 9943 9770 9946 9770 9942 99439997 10350 9996

Page 79: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 80: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 81: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Protein Q3N323

>tr|Q9N323|Q9N323_CAEEL Hypothetical protein - Caenorhabditis elegans. MSTNNYQTLSQNKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTT FWRTFFFYALSFGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVY YRNKSGTDHTVVANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASS APTTGLKADDVALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVW YAALIIVMSLYSVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVL VIPPQGCMMYCDAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFN GTKVLQTKYYKGQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKFIGVLAIVA FFGFMYTSFILFYRGSSIGKIIIRALDLVTIVVPPALPAVMGIGIFYAQRRLRQKSIYCI SPTTINTCGAIDVVCFDKTGTLTEDGLDFYALRVVNDAKIGDNIVQIAANDSCQNVVRAI ATCHTLSKINNELHGDPLDVIMFEQTGYSLEEDDSESHESIESIQPILIRPPKDSSLPDC QIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSPEMIMSLCRPETVPENFHDIVEEYSQHG YRLIAVAEKELVVGSEVQKTPRQSIECDLTLIGLVALENRLKPVTTEVIQKLNEANIRSV MVTGDNLLTALSVARECGIIVPNKSAYLIEHENGVVDRRGRTVLTIREKEDHHTERQPKI VDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQLVLVCNVFARMAPEQKQLLVEHLQDVG QTVAMCGDGANDCAALKAAHAGISLSEAEASIAAPFTSKVADIRCVITLISEGRAALVTS YSAFLCMAGYSLTQFISILLLYWIATSYSQMQFLFIDIAIVTNLAFLSSKTRAHKELAST PPPTSILSTASMVSLFGQLAIGGMAQVAVFCLITMQSWFIPFMPTHHDNDEDRKSLQGTA IFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYVTPIQYFLGC LQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKIRQRRKGA

Page 82: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Prediction of mitochondrial genes (human)

Page 83: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

NC_012920.1

Mitochondrial genomeNC_012920.1 annotation

tRNA scan prediction

tRNA scan lists 1- all the tRNAs in the current strand2- all the tRNAs in the complement strandThis tRNA is found at the end of the list

Page 84: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Page 85: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS

Conclusion

• Good tRNA prediction• If you try: very bad protein-coding gene

prediction….– Mitochondrial genome has not the same sequence

content (codon biais, signals) compare to the nuclear genome.

– You might try with ‘prokaryota’-like gene model, but the results are not perfect… !

Page 86: Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS