estructura de gene procariota promoter cds terminator transcription genomic dna mrna protein utr...

84
Estructura de gene procariota Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR UTR translation

Post on 19-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Estructura de gene procariotaEstructura de gene procariotaEstructura de gene procariotaEstructura de gene procariota

Promoter CDS Terminator

transcription

Genomic DNA

mRNA

protein

UTR UTR

translation

Page 2: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Operons:

Page 3: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

5’+1

3’mRNA

Prokaryotic Prokaryotic Gene OrganisationGene OrganisationProkaryotic Prokaryotic Gene OrganisationGene Organisation

Repressor

or

Activator

RNA

Polymerase

Promoter

Transcription: 2 consensus sequences and the startpoint- 10: TAATA T80A95t45A60a50T96

- 35: TTGACA T82T84G78A65C54a45

DNA

Ribosome

Translation: rbs (ribosomal binding site)Shine Delgarno AGGAGG

Leader Spacer Tailer

Promoter

Page 4: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

promotores and reguladoresen Procariotas

promotores and reguladoresen Procariotas

• Promoter determines:1. Which strand will serve as a template.2. Transcription starting point.3. Strength of polymerase binding.

• RNA polymerase subunit for promoter recognition is called sigma-factor

– Different variations (7 for E. coli)– Consensus binding sequences (Table 6.2 in textbook)

• Operons for co-transcription• Regulators affect the binding of RNA polymerase

to DNA (positive and negative)

Page 5: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Ejemplo de promotor procariotaEjemplo de promotor procariota

• Pribnow box located at –10 (6-7bp)

• Promoter sequence located at -35 (6bp)

Page 6: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Secuencias ConsensoSecuencias Consenso

• Promoters sequences can vary tremendously.

• RNA polymerase recognizes hundreds of different promoters

Page 7: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

TerminadoresTerminadoresTerminadoresTerminadores

• The terminator region pauses the polymerase and causes disassociation.

Page 8: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• The final mRNA may represent less than 5% of the transcribed DNA sequence

Producción de un ARN maduro en eucariotas

Page 9: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Región reguladora EXON 1 EXON 2 EXON 3 EXON 4 EXON n Región reguladora

PROMOTOR

5` 3`

Intrón 1 Intrón 2 Intrón 3

Unidad de transcripción

Secuencia que no se traduce Secuencia que no se traduce

Modelo simplificado de un gen humano

Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las

secuencias de los exones y las no codificantes (intrones y UTRs).

Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las

secuencias de los exones y las no codificantes (intrones y UTRs).

Page 10: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Los genes eucariotas contienen normalmente intrones

Page 11: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Tipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotas

Protein encoding genes• Transcription• RNA Polymerase II dependent promoters

RNA coding genes• Transcription • RNA Polymerase I and III dependent promoters• Type I and III splicing • No polyadenylation• No translation

• Type II splicing • Polyadenylation (exception histone mRNAs)• Translation

Page 12: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Estructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariota

• TATA box located at –25– TATA(A/T)A(A/T) – Recognized by TATA-binding protein

• Initiator sequence at +1– YYCARR; Y is C/T, R is G/A– +1 is usually the A

• Transcription factors bind to promoters– Position specific scoring matrix (PSSM)

• Possible distant regions acting as enhancers or silencers (even more than 50 kb). – More complex mechanism than prokaryotes

Page 13: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

La transcriptción puede ser modificada por factores que actuan en trans: activadores (enhancers) y silenciadores

Page 14: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation
Page 15: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Contains domainsthat adhere to cellsurfaces

Lacks domainsthat adhere to cellsurfaces

El splicing alternativo puede producir diferentes proteinas con

diferentes funciones

El splicing alternativo puede producir diferentes proteinas con

diferentes funciones

Page 16: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Eukaryotic Eukaryotic Gene OrganisationGene OrganisationEukaryotic Eukaryotic Gene OrganisationGene Organisation

Transcription: core promoter: loosely conserved initiator region (Inr) around TSS

~ - 25: TATA-box proximal promoter: ~ - 75: CAT (CCAAT)

~ - 170: GC-box

enhancer/silencer: upstream or downstream to promoter

Core PromoterProximal Promoter

TSS

TATAGC InrCAAT

Promoter

coreproximal

Translation:

• 5‘ Kozak sequence: GCCACCATG

• 3‘ polyadenylation site: AATAAA

Page 17: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure

Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure

• No operons

• Capping at 5’ end and polyadenylation at 3’ end– Transport of mRNA out of nucleus– Effects stability and efficiency of translation

• Introns

• Alternative splicing

Page 18: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

ResumenResumenResumenResumen

• Prokaryotic genes

• Eukaryotic genes

gene genegenepromoter

start stop

terminator

exon exonexonpromoter

start stopdonor acceptor

intron intron

Page 19: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Gene prediction: Prokaryotes vs. Eukaryotes

Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)

• Contiguous open reading frames (ORF)

• Polycistronic mRNAs

• Short intergenic sequences

Good method: detecting large ORFs

• Complications: • Sequencing errors

• very small genes will be missed

• Overlapping genes on both strands

Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)

• Contiguous open reading frames (ORF)

• Polycistronic mRNAs

• Short intergenic sequences

Good method: detecting large ORFs

• Complications: • Sequencing errors

• very small genes will be missed

• Overlapping genes on both strands

Page 20: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Promoter and Gene prediction: Prokaryotes vs. Eukaryotes

Promoter and Gene prediction: Prokaryotes vs. Eukaryotes

•Promoter elements •core promoter

•initiator region (Inr)•TATA box•Downstream promoter element (DPE)

•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes

•Enhancers/silencers sites (less useful)

•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-

adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)

•content sensors (base composition, codon usage, hexamer usage)

•Promoter elements •core promoter

•initiator region (Inr)•TATA box•Downstream promoter element (DPE)

•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes

•Enhancers/silencers sites (less useful)

•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-

adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)

•content sensors (base composition, codon usage, hexamer usage)

Page 21: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• The speed with which new data are collected increases and exceedes the rate with which they could be analysed.

• Whole-genome sequences for more than 800 organisms (bacteria, archaea, and eukaryota as well as many viruses and organells) are either complete or being determined.

• Across all sequenced species, nearly half of the potential genes can not be assigned a specific role.

El retoEl reto

Page 22: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar

todos los genestodos los genes

Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar

todos los genestodos los genes

Page 23: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction

Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction

• Búsqueda por homología

• Análisis de señales en las secuencias

• Análisis estadísticos

Page 24: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Evolutionary relationships

ancestor

species 1 species 2 species 3

Paralogues: homologous proteins that perform different but related functions within one organism.

Orthologues: homologous proteins that perform the same function in different species.

¿Porqué homología?¿Porqué homología?¿Porqué homología?¿Porqué homología?

Page 25: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.

• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.

Low coverage, high accuracyLow coverage, high accuracy

Homology SearchingHomology SearchingHomology SearchingHomology Searching

Page 26: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction

Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction

• Homology searching

• Analysis of sequence signals

• Statistical analysis

• Homology searching

• Analysis of sequence signals

• Statistical analysis

Page 27: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

¿Que señales se pueden emplear en

bioinformática para la predicción de

genes?

¿Que señales se pueden emplear en

bioinformática para la predicción de

genes?

Page 28: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Genomic sequences tend towards randomness;Genes are non-random.

Genomic sequences tend towards randomness;Genes are non-random.

¿Que diferencia a los genes de otras secuencias genómicas ?

¿Que diferencia a los genes de otras secuencias genómicas ?

Page 29: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Translated DNA sequences are restricted in the choice for nucleotides in the first, second (and to a lesser extend) third position of the codons.

Occurrence of a certain base in first, second and third position of the potential codons will not be random.

123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT

A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)

123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT

A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)

Confidence levels can be calculated because large sets of coding and non-coding sequences have been analyzed.

Base compositionBase composition

Page 30: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Frequency of the four different nucleotides at the different codon positions in human coding regions.

Frequency of the four different nucleotides at the different codon positions in human coding regions.

Base composition biasBase composition biasBase composition biasBase composition bias

Page 31: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058

Growth Factor Mouse

Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons

Not an easily predictable gene !

Growth Factor Mouse

Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons

Not an easily predictable gene !

1011 tata

Bottner M, Laaff M, Schechinger B, Rappold G, Unsicker K, Suter-Crazzolara C. Gene. 1999 (237):105-11 .

Our model geneOur model gene

Page 32: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.

‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.

codingcoding

non-codingnon-coding

TestcodeTestcodeTestcodeTestcode

Page 33: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.

Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.

Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.

Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.

Base composition biasBase composition biasBase composition biasBase composition bias

Page 34: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Codon usage bias Codon usage bias

The frequency of usage of

each codon (per thousand)

in human coding regions.

The relative frequency of each codon among synonymous codons.

Page 35: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

The human codon usage tableThe human codon usage table

(http://www.kazusa.or.jp/codon/)(http://www.kazusa.or.jp/codon/)

Page 36: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)

Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4

Frequency dependent on species, level of gene expression.

Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)

Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4

Frequency dependent on species, level of gene expression.

Frequency of usage

Relative Frequency

Codon usage biasCodon usage biasCodon usage biasCodon usage bias

Page 37: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed

Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success

Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed

Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success

Codon usage biasCodon usage biasCodon usage biasCodon usage bias

Page 38: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Analysis of Sequence SignalsAnalysis of Sequence Signals

Content Sensors (Large sequence motifs):

• base composition• codon usage• hexamer usage

Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)

Content Sensors (Large sequence motifs):

• base composition• codon usage• hexamer usage

Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)

Page 39: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

String matchingString matchingString matchingString matching

Input: Output: A text string t of length n. All instances of the

pattern in the text. A patterns string p of length m.

Page 40: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.

• Disadvantage: many false positives.

• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.

• Disadvantage: many false positives.

...ATGATAGATATACAGATTATATAGATCGAT...

...ATGATAGATATACAGATTATATAGATCGAT...

PatternsPatternsPatternsPatterns

TATA

TATA-boxTATA

Page 41: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Stop codonsUGA, UAA, UAG

Stop codonsUGA, UAA, UAG

StartcodonGCCACCAUGGKozak sequence

StartcodonGCCACCAUGGKozak sequence

Polyadenylationsignals

YGUGUUYY (N)20-30 AAUAAA

Polyadenylationsignals

YGUGUUYY (N)20-30 AAUAAA

Termination sequences

(not well defined in eukaryotes)

Termination sequences

(not well defined in eukaryotes)

Page 42: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

55’ splice site

CAG/GTAAGTAG

33’ splice site

(T)10NCAG/G(C) 9

Bbranchpoint

CT(G/A)A(C/T)

Jsplicejunction

MAG/G

55’ splice site

CAG/GTAAGTAG

33’ splice site

(T)10NCAG/G(C) 9

Bbranchpoint

CT(G/A)A(C/T)

Jsplicejunction

MAG/G

5 53 3

B B

BB

J J

+ 3

3

Splice SitesSplice SitesSplice SitesSplice Sites

Page 43: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.

• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.

Profile or Profile or Position Weigth MatrixPosition Weigth MatrixProfile or Profile or Position Weigth MatrixPosition Weigth Matrix

1234567...

ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...

1234567...

ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...

1234567...

A 4040351...

C 0410003...

G 0103000...

T 1002201...

consensus ACAGAAC...

1234567...

A 4040351...

C 0410003...

G 0103000...

T 1002201...

consensus ACAGAAC...

Alignment Profile

Page 44: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction

Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction

• Homology searching

• Analysis of sequence signals

• Statistical analysis

• Homology searching

• Analysis of sequence signals

• Statistical analysis

Page 45: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

A C D E F G H GapY TSL ... ...

Score for finding each aa at a certain position

POS1

2 3

4 56

...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles

What is a Gribskov Profile?What is a Gribskov Profile?

A Gribskov profile is a weight matrix of the probabilities of appearance of amino acids in a certain position in a multiple alignment.

Page 46: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles

Differences between Gribskov Profiles and commonsequence comparison methodsDifferences between Gribskov Profiles and commonsequence comparison methods

A group of related sequences can be used to build the profile

The profile includes position-specific penalties for insertion and deletion

Page 47: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles

What is needed to create a Gribskov Profile?What is needed to create a Gribskov Profile?

A group of functionally related proteins

GlobinsImmunoglobulins

Aligned by

Similarity

Three dimensional structure

1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

A mutational distance matrix

Blosum62PAM250 Dayhoff

Page 48: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

Gribskov ProfilesGribskov Profiles

Alig

ned

posit

ion

s

Score of each aa at a certain position

Sequence position-specific scoring matrix M(p,a)Sequence position-specific scoring matrix M(p,a)

A C D E ................ W Y

1234...N

Gap

Penalty for deletion or insertion in that position

Number of positions in the alignment

21 Columns

20 of them specify1 specifies

Page 49: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Y(a,b)

Value in the mutational distance matrix

M(p,C)= W(p,W) * Y(C,W)

The profile is filled using the The profile is filled using the Multiple alignment

Mutational distance matrix

Creating a Gribskov ProfileCreating a Gribskov Profile

W(p,b)

Weight of appearance of aa b at position p

= n(b,p)/ NR

n(b,p) is the number of times that aa b appears in position p

NR number of rows in the alignment

M(p,a)= b=1 W(p,b) * Y(a,b)20

Page 50: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 B -2 6 C 0 -3 9 D -2 6 -3 6E -1 2 -4 2 5 F -2 -3 -2 -3 -3 6 G 0 -1 -3 -1 -2 -3 6 H -2 -1 -3 -1 0 -1 -2 8 I -1 -3 -1 -3 -3 0 -4 -3 4 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

W

C

-2

Mutational Distance MatrixMutational Distance MatrixMutational Distance MatrixMutational Distance MatrixBlosum62 matrixBlosum62 matrix

Page 51: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

The profile is filled using the The profile is filled using the Multiple alignment

Mutational distance matrix

W(p,b)

Weight of appearance of aa b at position p

= n(b,p)/ NR Y(a,b)

Value in the mutational distance matrix

Creating a Gribskov ProfileCreating a Gribskov Profile

n(b,p) is the number of times that aa b appears in position p

NR number of rows in the alignment

M(p,C)= W(p,W) * Y(C,W)

Y(C,W) = -2

M(p,a)= b=1 W(p,b) * Y(a,b)20

Page 52: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

A C D E F G H GapY TSL ... ...

Score for finding each aa at a certain position

POS1

2 3

4 56

M(p,a)= b=1 W(p,b) * Y(a,b)20

M(1,A)= b=1 W(1,b) * Y(A,b)

M(1,A)= ( W(1,A) * Y(A,A) ) + (W(1,C) * Y(A,C) ) +......+ ( W(1, Y) *Y(A,Y) )

M(1,A)= ( 0.025/6 * 4) + ( 1/6 * 0 ) +......+ ( 0.025/6 * -1)

Consensus sequence

symbol with largest value in each position(CCGGTL)

Creating a Gribskov ProfileCreating a Gribskov Profile

...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

M(1,C)= b=1 W(1,b) * Y(C,b)

-2

aa not present in a position get a very small weight 0,025/NR

aa not present in a position get a very small weight 0,025/NR

Page 53: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

A C D E F G H GapY TSL ... ...

Score for finding each aa at a certain position

POS1

2 3

4 56

Consensus sequence

symbol with largest value in each position

(CCGGTL)

Scoring with a Gribskov ProfileScoring with a Gribskov Profile

...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

-2

P(CCGGTL)= Pp1(C)* Pp2(C)* Pp3(G)* Pp4(G)* Pp5(T) * Pp6(L)

P(CCGGTL)= log Pp1(C)+ log Pp2(C)+ log Pp3(G)+ log Pp4(G)+ log Pp5(T)+ log Pp6(L)

Probability of any sequence is calculated in the same wayProbability of any sequence is calculated in the same way

Page 54: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

IntroductionIntroductionIntroductionIntroduction

Gribskov ProfileGribskov Profile

Hidden Markov ModelsHidden Markov Models

Definition

Scoring a sequence with an HMM

Building a Hidden Markov ModelBuilding a Hidden Markov Model

State order of an HMM

Biological application of HMMs

Scoring a sequence with a Profile

Basic Architecture

HMM programs in HUSAR

Estimation of the modelProblems building an HMM

DefinitionCreating a Gribskov Profile

Page 55: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Advantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov Models

C G

C

-

T

AP=0.6

P=0.1

P=0.2

P=0.09

P=0.01

Markov Models are probabilistic, models, with a solid statistical foundation

Markov Models are probabilistic, models, with a solid statistical foundation

In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions

In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions

In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.

Page 56: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Markov Model is based on active domains only !!

Domain 1 (active binding site) ATGTCGTCGTCGDomain 2 (never found, inactive) ATGTGGTCGTCGDomain 3 (never found, inactive) ATGTCATCGTCGDomain 4 (active) ATGTGATCGTCG

Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model

If a G is found at position 3, P(T)4=1.0

If a T is found at position 4, P(C)5=0.5, P(G)5=0.5

If an A is found at position 6, P(T)7=1.0

If a C is found at position 5, P(A)6=0.0, P(G)6=1.0

If a G is found at position 5, P(A)6=1.0, P(G)6=0.0

If a G is found at position 6, P(T)7=1.0

123456…

Page 57: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

First order Markov ModelFirst order Markov Model

Order state of HMMsOrder state of HMMsOrder state of HMMsOrder state of HMMs

Fifth order Markov ModelFifth order Markov Model

Captures the first order correlation between neighboring nucleotides

HMM models can use preceding, succeeding or surrounding residues

There is no real limit in the number of preceding residues that can be used for an HMM (computing time!)

Markov Models take into account additional information about neighboring residues.Markov Models take into account additional information about neighboring residues.

Page 58: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Biological applications of HMMsBiological applications of HMMsBiological applications of HMMsBiological applications of HMMs

Radiation hybrid mappingRadiation hybrid mapping

Genetic linkage mapping Genetic linkage mapping

Phylogenetic analysisPhylogenetic analysis

Protein homology recognitionProtein homology recognition

Protein secondary structure predictionProtein secondary structure prediction

Profile HMM librariesProfile HMM libraries

Gene findingGene finding

(Krushyak et al., 1996)

(Felsenstein & Churchill, 1996)

(Goldman et al., 1996)

(Birney & Durbin, 1997;Henderson, 1997; Krogh, 1997;Lukashin & Borodovsky, 1998)

(PROSITE; Pfam database)

(Karplus et al., 1999)

(Sloniw et al., 1997)

Page 59: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

a b a Observed symbol sequence

A B End

P1(a)

P1(b)

P2(a)

P2(b)

t 1,2 t 2,end

t 1,1

P(aba|HMM) = t 1,2 t 2,endt 1,1P1(a) P1(b) P2(a)

HMM

Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model

t 2,2

states

transitions

Page 60: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

E 5 End0.1 1.0

0.9

Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model

states

transitions

I

0.9

Start1.0 0.1

PA=(0.25)PC=(0.25)PG=(0.25)PT=(0.25)

PA=(0.05)PC=(0)PG=(0.95)PT=(0)

PA=(0.4)PC=(0.1)PG=(0.1)PT=(0.4)

Page 61: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model

Markov Models assume that sequences are generated independently of the modelMarkov Models assume that sequences are generated independently of the model

Applied to time series or to linear sequencesApplied to time series or to linear sequences

Page 62: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

A C D E F G H I ..

. Y

Basic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMM

Start d1 End

m1

d2 d3

C

i3i2i1i0

A C D E F G H I ..

. Y

m2C

A C D E F G H I ..

. Y

m3Y

Probabilities

0.3

0.015

0.06

0.01 0.5

0.01

Match states

Model the distributionof symbols in the correspondingcolumn of an alignment

AlignmentAlignment

from information contained in

Page 63: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Methods in Gene PredictionMethods in Gene PredictionMethods in Gene PredictionMethods in Gene Prediction

Ab initio analysis of genomic sequences:

Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)

Comparison of protein and genomic sequences:

Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)

Cross-species genomic sequence comparisons:

CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)

Ab initio analysis of genomic sequences:

Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)

Comparison of protein and genomic sequences:

Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)

Cross-species genomic sequence comparisons:

CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)

Page 64: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Gene prediction programs (many with homology searching capabilities)

GeneMachine http://genome.nhgri.nih.gov/genemachine

Genscan http://genome.dkfz-heidelberg.de

GenomeScan http://genes.mit.edu/genomescan

Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml

Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml

HMMgene http://www.cbs.dtu.dk/services/HMMgene

Genie http://www.fruitfly.org/seq_tools/genie.html

GeneMark http://www.ebi.ac.uk/genemark

GeneID http://www1.imim.es/software/geneid/geneid.html#top

GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html

MZEF and POMBE http://argon.cshl.org/genefinder/

AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html

MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html

Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg

WebGene http://www.itba.mi.cnr.it/webgene

GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html

Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound

Gene-prediction programs: alignment based

Procrustes http://www-hto.usc.edu/software/procrustes/index.hl

GeneWise2 http://www.sanger.ac.uk/Software/Wise2

SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi

PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html

Gene prediction programs (many with homology searching capabilities)

GeneMachine http://genome.nhgri.nih.gov/genemachine

Genscan http://genome.dkfz-heidelberg.de

GenomeScan http://genes.mit.edu/genomescan

Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml

Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml

HMMgene http://www.cbs.dtu.dk/services/HMMgene

Genie http://www.fruitfly.org/seq_tools/genie.html

GeneMark http://www.ebi.ac.uk/genemark

GeneID http://www1.imim.es/software/geneid/geneid.html#top

GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html

MZEF and POMBE http://argon.cshl.org/genefinder/

AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html

MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html

Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg

WebGene http://www.itba.mi.cnr.it/webgene

GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html

Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound

Gene-prediction programs: alignment based

Procrustes http://www-hto.usc.edu/software/procrustes/index.hl

GeneWise2 http://www.sanger.ac.uk/Software/Wise2

SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi

PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html

Page 65: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Gene-prediction programs: comparative genomics

Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan

SLAM http://bio.math.berkeley.edu/slam

Twinscan http:/ genes.cs.wustl.edu

Finding ORFs and splice sites

DioGenes http://www.cbc.umn.edu/diogenes/index.html

OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html

YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi

CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html

Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html

NetGene2 http://www.cbs.dtu.dk/services/NetGene2

RNA gene prediction

tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

Gene-prediction programs: comparative genomics

Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan

SLAM http://bio.math.berkeley.edu/slam

Twinscan http:/ genes.cs.wustl.edu

Finding ORFs and splice sites

DioGenes http://www.cbc.umn.edu/diogenes/index.html

OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html

YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi

CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html

Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html

NetGene2 http://www.cbs.dtu.dk/services/NetGene2

RNA gene prediction

tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

Page 66: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search

• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search

FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)

Page 67: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

Page 68: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.

• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.

GENIEGENIE (UCLA) (UCLA)GENIEGENIE (UCLA) (UCLA)

Page 69: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

Page 70: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements

• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements

GrailEXPGrailEXPGrailEXPGrailEXP

Page 71: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058

1011 tata

Page 72: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.

• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.

GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)

Page 73: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

Page 74: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.

• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.

TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)

Page 75: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

TWINSCAN

GENSCAN

actual gene structure

alignments to human genomic sequences repeat sequences reported

Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome

Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome

Page 76: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

cDNAcDNA

proteinprotein

Page 77: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.

• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.

GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)

Page 78: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).

• Common pitfall: for which organism was the application developed?

• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.

• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.

• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).

• Common pitfall: for which organism was the application developed?

• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.

• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.

How(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics tools

Page 79: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Geneprediction

tool

Geneprediction

tool

Evaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction Tools

The ideal testset is a segment of DNA for which all genes have been described experimentally.

The ideal testset is a segment of DNA for which all genes have been described experimentally.

Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%

Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%

Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%

Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%

Page 80: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Accuracy versus G+C content

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan Morgan MZEF

Acc

ura

cy 0 - 40% 40 - 50% 50 - 60% 60 - 100%

http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html

Accuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C content

Page 81: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Exon accuracy

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan MZEF

Exo

n a

ccu

racy

Sensitivity(false negatives)

Specificity(false positives)

Partially correctpredicted

http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html

Exon accuracyExon accuracyExon accuracyExon accuracy

Page 82: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Accuracy versus exon length

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan MZEF

Acc

ura

cy

0 - 24

25 - 49

50 - 74

75 - 99

100 - 199

200 - 299

300 +

http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html

Accuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon length

Page 83: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Accuracy versus exon type

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan

Ac

cu

rac

y Initial

Internal

Terminal

Single

http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html

Accuracy versus exon typeAccuracy versus exon typeAccuracy versus exon typeAccuracy versus exon type

Page 84: Estructura de gene procariota Promoter CDS Terminator transcription Genomic DNA mRNA protein UTR translation

Open Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future Directions

•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.

•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.

•Promoter recognition

•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.

•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.

•Promoter recognition