prokaryotic gene finding

29
Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student [email protected]

Upload: nasim-mcintyre

Post on 02-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Prokaryotic gene finding. Marie Skovgaard Ph.D. student [email protected]. Prokarya. Can you spot the gene?. >AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA - PowerPoint PPT Presentation

TRANSCRIPT

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis Prokaryotic gene

finding

Marie SkovgaardPh.D. student

[email protected]

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Prokarya

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

>AE006641GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTACTTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAACTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAAACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATTAGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGAAGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAGGGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAACTCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAATGACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATCATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCCAGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGATACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCCCAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTGCCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATTAGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTAGAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCAAAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAAACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAAATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGACAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGTAGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTACATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGGGATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGACTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGAAGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG

Can you spot the gene?

/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Identifying open reading frames

/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

A. pernix (43% AT)

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Why care about over annotated genes?

Genome comparison:• Fraction of known

proteins• Average gene length • Amino acid

composition

The quality of our databases

To gain biological knowledge

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Regular expression

Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/

The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence:

ACAC—ATC or TGCT--AGG

Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions.

ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Markov model

A 0.8C G T 0.2

A C 0.8G 0.2T

A 0.8C 0.2G T

A 1.0C G T

A C G 0.2T 0.8

A C 0.8G 0.2T

A 0.2C 0.4G 0.2T 0.2

1.0 1.0 0.4

1.0 1.0

0.6

0.6

0.4

ACA---ATGTCAACTATCACAC--AGCAGA---ATGACCG--ATC

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Profile HMM

Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments.

Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way.

Begin End

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Markov Model for gene findingDefine a simple architecture:

/ (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

ATGC

TAGTAATGA

ATGC

ATGC

ATGGTGTTG

S1 S2 S3 S4 S5

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Markov models

Knowledge of the structure of genes is used to define the architecture of the model.

Sequences (x) from known genes are used to estimate the parameters of the model – training of the model.

The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Training

Sequence x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 …..….xn

Sta

tes S1

S2

S3

S4

S5

ATGC

TAGTAATGA

ATGC

ATGC

ATGGTGTTG

S1 S2 S3 S4 S5

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Model after training

A: 0.22T: 0.24G: 0.27C: 0.27

TAG: 0.6TAA: 0.3TGA: 0.1

A: 0.25T: 0.23G: 0.27C: 0.25

A: 0.26T: 0.24G: 0.25C: 0.25

ATG: 0.77TTG: 0.11GTG: 0.12CTG: 0.00

S1 S2 S3 S4 S5

0.98

The trained model can be used to search for genes in DNA sequences.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Searching with the HMM

S1

S2

S3

S4

S5

Sequence ATG A T T T C G C G C G A T ……….T A G

Sta

tes 0.77 0.00 0.00 0.00

0.00 (0.22*0.77) 0.00 0.00

0.00 0.00 (0.23*0.22*0.77) 0.00

0.00 0.00 0.00 (0.24*0.23*0.22*0.77)

0.00 0.00 0.00 =P(x|M)

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Log-Odds score

The propability of a sequence gets infinitly small as the sequence x becomes longer.

This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25

From this the Log-Odds score can be calculated: -log(P(x|M)/P(x|NULL))

A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Is the model to simple?

ATGC

TAGTAATGA

ATGC

ATGC

ATGGTGTTG

S1 S2 S3 S4 S5

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Codon usage

Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies.

Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage.

Variations in codon usage within an organism can be modelled in different coding models in the HMM.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

1stPosition

2nd Position 3rdPositionU C A G

U

30,407 Phe22,581 Phe18,943 Leu18,629 Leu

11,523 Ser11,766 Ser

9,793 Ser12,195 Ser

22,048 Tyr16,669 Tyr2,706 Stop

326 Stop

7,062 Cys8,846 Cys1,260 Stop

20,756 Trp

UCAG

C

15,018 Leu15,104 Leu

5,316 Leu71,710 Leu

9,569 Pro7,491 Pro

11,496 Pro31,614 Pro

17,631 His13,272 His20,912 Gln39,285 Gln

28,458 Arg29,968 Arg

4,860 Arg7,404 Arg

UCAG

A

41,375 Ile34,261 Ile5,967 Ile

37,994 Met

12,223 Thr31,889 Thr

9,683 Thr19,682 Thr

24,189 Asn29,529 Asn45,812 Lys14,076 Lys

11,982 Ser21,907 Ser2,899 Arg1,694 Arg

UCAG

G

24,910 Val20,800 Val14,850 Val35,979 Val

20,808 Ala34,770 Ala27,468 Ala45,862 Ala

43,817 Asp25,996 Asp53,780 Glu24,312 Glu

33,731 Gly40,396 Gly10,902 Gly15,118 Gly

UCAG

Fields : [number] [amino acid]

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Is the model to simple?

ATGGTGTTG

TAGTAATGA

S2

S3

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

S1

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

HMM for gene finding

TAGTAATGA

S4

S3

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

ATGGTGTTG

S1

S2

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Multiple coding models

TAGTAATGA

E

ATGGTGTTG

S

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

AAA ATA AGA ACA TAA TTA TGA TCAAAT ATT AGT ACT TAT TTT TGT TCTAAG ATG AGG ACG TAG TTG TGG TCGAAC ATC AGC ACC TAC TTC TGC TCCGAA GTA GGA GCA CAA CTA CGA CCAGAT GTT GGT GCT CAT CTT CGT CCTGAG GTG GGG GCG CAG CTG CGG CCGGAC GTC GGC GCC CAC CTC CGC CCC

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Order of the model

A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25).

The propability of a letter in a first order Markov model depends on the previous letter (di-nucleotide distributions).

Second order depends on the two previous letters (corresponding to a codon).

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Order of the coding model

Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder.

The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies.

Higher order models requires more training data and more computational time when searching.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The Shine-Dalgarno sequence

The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit.The binding site is the Shine-Dalgarno sequence (SD).The SD is a purine-rich sequence (consensussequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

EasyGene

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

R. prowazekii

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

GeneMark.hmmhttp://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi

Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115.

EasyGenehttp://cbs.dtu.dk/services/EasyGene

Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21