estructura de gene procariota promoter cds terminator transcription genomic dna mrna protein utr...

Estructura de gene procariotaEstructura de gene procariotaEstructura de gene procariotaEstructura de gene procariota

Promoter CDS Terminator

transcription

Genomic DNA

mRNA

protein

UTR UTR

translation

Operons:

5’+1

3’mRNA

Prokaryotic Prokaryotic Gene OrganisationGene OrganisationProkaryotic Prokaryotic Gene OrganisationGene Organisation

Repressor

or

Activator

RNA

Polymerase

Promoter

Transcription: 2 consensus sequences and the startpoint- 10: TAATA T80A95t45A60a50T96

- 35: TTGACA T82T84G78A65C54a45

DNA

Ribosome

Translation: rbs (ribosomal binding site)Shine Delgarno AGGAGG

Leader Spacer Tailer

Promoter

promotores and reguladoresen Procariotas

promotores and reguladoresen Procariotas

• Promoter determines:1. Which strand will serve as a template.2. Transcription starting point.3. Strength of polymerase binding.

• RNA polymerase subunit for promoter recognition is called sigma-factor

– Different variations (7 for E. coli)– Consensus binding sequences (Table 6.2 in textbook)

• Operons for co-transcription• Regulators affect the binding of RNA polymerase

to DNA (positive and negative)

Ejemplo de promotor procariotaEjemplo de promotor procariota

• Pribnow box located at –10 (6-7bp)

• Promoter sequence located at -35 (6bp)

Secuencias ConsensoSecuencias Consenso

• Promoters sequences can vary tremendously.

• RNA polymerase recognizes hundreds of different promoters

TerminadoresTerminadoresTerminadoresTerminadores

• The terminator region pauses the polymerase and causes disassociation.

• The final mRNA may represent less than 5% of the transcribed DNA sequence

Producción de un ARN maduro en eucariotas

Región reguladora EXON 1 EXON 2 EXON 3 EXON 4 EXON n Región reguladora

PROMOTOR

5` 3`

Intrón 1 Intrón 2 Intrón 3

Unidad de transcripción

Secuencia que no se traduce Secuencia que no se traduce

Modelo simplificado de un gen humano

Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las

secuencias de los exones y las no codificantes (intrones y UTRs).

Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las

secuencias de los exones y las no codificantes (intrones y UTRs).

Los genes eucariotas contienen normalmente intrones

Tipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotas

Protein encoding genes• Transcription• RNA Polymerase II dependent promoters

RNA coding genes• Transcription • RNA Polymerase I and III dependent promoters• Type I and III splicing • No polyadenylation• No translation

• Type II splicing • Polyadenylation (exception histone mRNAs)• Translation

Estructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariota

• TATA box located at –25– TATA(A/T)A(A/T) – Recognized by TATA-binding protein

• Initiator sequence at +1– YYCARR; Y is C/T, R is G/A– +1 is usually the A

• Transcription factors bind to promoters– Position specific scoring matrix (PSSM)

• Possible distant regions acting as enhancers or silencers (even more than 50 kb). – More complex mechanism than prokaryotes

La transcriptción puede ser modificada por factores que actuan en trans: activadores (enhancers) y silenciadores

Contains domainsthat adhere to cellsurfaces

Lacks domainsthat adhere to cellsurfaces

El splicing alternativo puede producir diferentes proteinas con

diferentes funciones

El splicing alternativo puede producir diferentes proteinas con

diferentes funciones

Eukaryotic Eukaryotic Gene OrganisationGene OrganisationEukaryotic Eukaryotic Gene OrganisationGene Organisation

Transcription: core promoter: loosely conserved initiator region (Inr) around TSS

~ - 25: TATA-box proximal promoter: ~ - 75: CAT (CCAAT)

~ - 170: GC-box

enhancer/silencer: upstream or downstream to promoter

Core PromoterProximal Promoter

TSS

TATAGC InrCAAT

Promoter

coreproximal

Translation:

• 5‘ Kozak sequence: GCCACCATG

• 3‘ polyadenylation site: AATAAA

Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure

Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure

• No operons

• Capping at 5’ end and polyadenylation at 3’ end– Transport of mRNA out of nucleus– Effects stability and efficiency of translation

• Introns

• Alternative splicing

ResumenResumenResumenResumen

• Prokaryotic genes

• Eukaryotic genes

gene genegenepromoter

start stop

terminator

exon exonexonpromoter

start stopdonor acceptor

intron intron

Gene prediction: Prokaryotes vs. Eukaryotes

Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)

• Contiguous open reading frames (ORF)

• Polycistronic mRNAs

• Short intergenic sequences

Good method: detecting large ORFs

• Complications: • Sequencing errors

• very small genes will be missed

• Overlapping genes on both strands

Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)

• Contiguous open reading frames (ORF)

• Polycistronic mRNAs

• Short intergenic sequences

Good method: detecting large ORFs

• Complications: • Sequencing errors

• very small genes will be missed

• Overlapping genes on both strands

Promoter and Gene prediction: Prokaryotes vs. Eukaryotes

Promoter and Gene prediction: Prokaryotes vs. Eukaryotes

•Promoter elements •core promoter

•initiator region (Inr)•TATA box•Downstream promoter element (DPE)

•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes

•Enhancers/silencers sites (less useful)

•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-

adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)

•content sensors (base composition, codon usage, hexamer usage)

•Promoter elements •core promoter

•initiator region (Inr)•TATA box•Downstream promoter element (DPE)

•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes

•Enhancers/silencers sites (less useful)

•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-

adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)

•content sensors (base composition, codon usage, hexamer usage)

• The speed with which new data are collected increases and exceedes the rate with which they could be analysed.

• Whole-genome sequences for more than 800 organisms (bacteria, archaea, and eukaryota as well as many viruses and organells) are either complete or being determined.

• Across all sequenced species, nearly half of the potential genes can not be assigned a specific role.

El retoEl reto

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html

Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar

todos los genestodos los genes

Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar

todos los genestodos los genes

Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction

Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction

• Búsqueda por homología

• Análisis de señales en las secuencias

• Análisis estadísticos

Evolutionary relationships

ancestor

species 1 species 2 species 3

Paralogues: homologous proteins that perform different but related functions within one organism.

Orthologues: homologous proteins that perform the same function in different species.

¿Porqué homología?¿Porqué homología?¿Porqué homología?¿Porqué homología?

• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.

• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.

Low coverage, high accuracyLow coverage, high accuracy

Homology SearchingHomology SearchingHomology SearchingHomology Searching

Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction


• Homology searching

• Analysis of sequence signals

• Statistical analysis




¿Que señales se pueden emplear en

bioinformática para la predicción de

genes?

¿Que señales se pueden emplear en

bioinformática para la predicción de

genes?

Genomic sequences tend towards randomness;Genes are non-random.

Genomic sequences tend towards randomness;Genes are non-random.

¿Que diferencia a los genes de otras secuencias genómicas ?

¿Que diferencia a los genes de otras secuencias genómicas ?

Translated DNA sequences are restricted in the choice for nucleotides in the first, second (and to a lesser extend) third position of the codons.

Occurrence of a certain base in first, second and third position of the potential codons will not be random.

123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT

A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)

123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT

A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)

Confidence levels can be calculated because large sets of coding and non-coding sequences have been analyzed.

Base compositionBase composition

Frequency of the four different nucleotides at the different codon positions in human coding regions.

Frequency of the four different nucleotides at the different codon positions in human coding regions.

Base composition biasBase composition biasBase composition biasBase composition bias

1066 1345 2427 3058

Growth Factor Mouse

Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons

Not an easily predictable gene !

Growth Factor Mouse

Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons

Not an easily predictable gene !

1011 tata

Bottner M, Laaff M, Schechinger B, Rappold G, Unsicker K, Suter-Crazzolara C. Gene. 1999 (237):105-11 .

Our model geneOur model gene

‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.

‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.

codingcoding

non-codingnon-coding

TestcodeTestcodeTestcodeTestcode

Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.

Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.

Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.

Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.

Base composition biasBase composition biasBase composition biasBase composition bias

Codon usage bias Codon usage bias

The frequency of usage of

each codon (per thousand)

in human coding regions.

The relative frequency of each codon among synonymous codons.

The human codon usage tableThe human codon usage table

(http://www.kazusa.or.jp/codon/)(http://www.kazusa.or.jp/codon/)

Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)

Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4

Frequency dependent on species, level of gene expression.

Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)

Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4

Frequency dependent on species, level of gene expression.

Frequency of usage

Relative Frequency

Codon usage biasCodon usage biasCodon usage biasCodon usage bias

Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed

Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success

Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed

Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success

Codon usage biasCodon usage biasCodon usage biasCodon usage bias

Analysis of Sequence SignalsAnalysis of Sequence Signals

Content Sensors (Large sequence motifs):

• base composition• codon usage• hexamer usage

Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)

Content Sensors (Large sequence motifs):

• base composition• codon usage• hexamer usage

Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)

String matchingString matchingString matchingString matching

Input: Output: A text string t of length n. All instances of the

pattern in the text. A patterns string p of length m.

• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.

• Disadvantage: many false positives.

• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.

• Disadvantage: many false positives.

...ATGATAGATATACAGATTATATAGATCGAT...

...ATGATAGATATACAGATTATATAGATCGAT...

PatternsPatternsPatternsPatterns

TATA

TATA-boxTATA

Stop codonsUGA, UAA, UAG

Stop codonsUGA, UAA, UAG

StartcodonGCCACCAUGGKozak sequence

StartcodonGCCACCAUGGKozak sequence

Polyadenylationsignals

YGUGUUYY (N)20-30 AAUAAA

Polyadenylationsignals

YGUGUUYY (N)20-30 AAUAAA

Termination sequences

(not well defined in eukaryotes)

Termination sequences

(not well defined in eukaryotes)

55’ splice site

CAG/GTAAGTAG

33’ splice site

(T)10NCAG/G(C) 9

Bbranchpoint

CT(G/A)A(C/T)

Jsplicejunction

MAG/G

55’ splice site

CAG/GTAAGTAG

33’ splice site

(T)10NCAG/G(C) 9

Bbranchpoint

CT(G/A)A(C/T)

Jsplicejunction

MAG/G

5 53 3

B B

BB

J J

+ 3

3

Splice SitesSplice SitesSplice SitesSplice Sites

• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.

• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.

Profile or Profile or Position Weigth MatrixPosition Weigth MatrixProfile or Profile or Position Weigth MatrixPosition Weigth Matrix

1234567...

ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...

1234567...

ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...

1234567...

A 4040351...

C 0410003...

G 0103000...

T 1002201...

consensus ACAGAAC...

1234567...

A 4040351...

C 0410003...

G 0103000...

T 1002201...

consensus ACAGAAC...

Alignment Profile

A C D E F G H GapY TSL ... ...

Score for finding each aa at a certain position

POS1

2 3

4 56

...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles

What is a Gribskov Profile?What is a Gribskov Profile?

A Gribskov profile is a weight matrix of the probabilities of appearance of amino acids in a certain position in a multiple alignment.


Differences between Gribskov Profiles and commonsequence comparison methodsDifferences between Gribskov Profiles and commonsequence comparison methods

A group of related sequences can be used to build the profile

The profile includes position-specific penalties for insertion and deletion


What is needed to create a Gribskov Profile?What is needed to create a Gribskov Profile?

A group of functionally related proteins

GlobinsImmunoglobulins

Aligned by

Similarity

Three dimensional structure

1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

A mutational distance matrix

Blosum62PAM250 Dayhoff

1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~

Gribskov ProfilesGribskov Profiles

Alig

ned

posit

ion

s

Score of each aa at a certain position

Sequence position-specific scoring matrix M(p,a)Sequence position-specific scoring matrix M(p,a)

A C D E ................ W Y

1234...N

Gap

Penalty for deletion or insertion in that position

Number of positions in the alignment

21 Columns

20 of them specify1 specifies

Y(a,b)

Value in the mutational distance matrix

M(p,C)= W(p,W) * Y(C,W)

The profile is filled using the The profile is filled using the Multiple alignment

Mutational distance matrix

Creating a Gribskov ProfileCreating a Gribskov Profile

W(p,b)

Weight of appearance of aa b at position p

= n(b,p)/ NR

n(b,p) is the number of times that aa b appears in position p

NR number of rows in the alignment

M(p,a)= b=1 W(p,b) * Y(a,b)20

A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 B -2 6 C 0 -3 9 D -2 6 -3 6E -1 2 -4 2 5 F -2 -3 -2 -3 -3 6 G 0 -1 -3 -1 -2 -3 6 H -2 -1 -3 -1 0 -1 -2 8 I -1 -3 -1 -3 -3 0 -4 -3 4 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

W

C

-2

Mutational Distance MatrixMutational Distance MatrixMutational Distance MatrixMutational Distance MatrixBlosum62 matrixBlosum62 matrix

The profile is filled using the The profile is filled using the Multiple alignment

Mutational distance matrix

W(p,b)

Weight of appearance of aa b at position p

= n(b,p)/ NR Y(a,b)

Value in the mutational distance matrix


n(b,p) is the number of times that aa b appears in position p

NR number of rows in the alignment

M(p,C)= W(p,W) * Y(C,W)

Y(C,W) = -2

M(p,a)= b=1 W(p,b) * Y(a,b)20

Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~



POS1

2 3

4 56

M(p,a)= b=1 W(p,b) * Y(a,b)20

M(1,A)= b=1 W(1,b) * Y(A,b)

M(1,A)= ( W(1,A) * Y(A,A) ) + (W(1,C) * Y(A,C) ) +......+ ( W(1, Y) *Y(A,Y) )

M(1,A)= ( 0.025/6 * 4) + ( 1/6 * 0 ) +......+ ( 0.025/6 * -1)

Consensus sequence

symbol with largest value in each position(CCGGTL)


...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

M(1,C)= b=1 W(1,b) * Y(C,b)

-2

aa not present in a position get a very small weight 0,025/NR

aa not present in a position get a very small weight 0,025/NR

Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~



POS1

2 3

4 56

Consensus sequence

symbol with largest value in each position

(CCGGTL)

Scoring with a Gribskov ProfileScoring with a Gribskov Profile

...

... ... -2 -2 -2 -2 18

-42

115 895 -65 -223 -104-64

-82 -

302 -

142 -62 -121

-221

-121 -

401 -

241-81

-101

-161

-103

-203 -

283 -

223 -163-23

56 -304

416 196

-163

-223

-101 -302

-221 38 -181-181

-61 -102 -181 -

81 218-42

-103 -103

-343 -302 -43176

-21 -101 -21

139 -159-121

-101 -202 -282 -162 -182

-62

30 100 100 100 10030

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

-2

P(CCGGTL)= Pp1(C)* Pp2(C)* Pp3(G)* Pp4(G)* Pp5(T) * Pp6(L)

P(CCGGTL)= log Pp1(C)+ log Pp2(C)+ log Pp3(G)+ log Pp4(G)+ log Pp5(T)+ log Pp6(L)

Probability of any sequence is calculated in the same wayProbability of any sequence is calculated in the same way

IntroductionIntroductionIntroductionIntroduction

Gribskov ProfileGribskov Profile

Hidden Markov ModelsHidden Markov Models

Definition

Scoring a sequence with an HMM

Building a Hidden Markov ModelBuilding a Hidden Markov Model

State order of an HMM

Biological application of HMMs

Scoring a sequence with a Profile

Basic Architecture

HMM programs in HUSAR

Estimation of the modelProblems building an HMM

DefinitionCreating a Gribskov Profile

Advantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov Models

C G

C

-

T

AP=0.6

P=0.1

P=0.2

P=0.09

P=0.01

Markov Models are probabilistic, models, with a solid statistical foundation

Markov Models are probabilistic, models, with a solid statistical foundation

In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions

In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions

In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.

Markov Model is based on active domains only !!

Domain 1 (active binding site) ATGTCGTCGTCGDomain 2 (never found, inactive) ATGTGGTCGTCGDomain 3 (never found, inactive) ATGTCATCGTCGDomain 4 (active) ATGTGATCGTCG

Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model

If a G is found at position 3, P(T)4=1.0

If a T is found at position 4, P(C)5=0.5, P(G)5=0.5

If an A is found at position 6, P(T)7=1.0

If a C is found at position 5, P(A)6=0.0, P(G)6=1.0

If a G is found at position 5, P(A)6=1.0, P(G)6=0.0

If a G is found at position 6, P(T)7=1.0

123456…

First order Markov ModelFirst order Markov Model

Order state of HMMsOrder state of HMMsOrder state of HMMsOrder state of HMMs

Fifth order Markov ModelFifth order Markov Model

Captures the first order correlation between neighboring nucleotides

HMM models can use preceding, succeeding or surrounding residues

There is no real limit in the number of preceding residues that can be used for an HMM (computing time!)

Markov Models take into account additional information about neighboring residues.Markov Models take into account additional information about neighboring residues.

Biological applications of HMMsBiological applications of HMMsBiological applications of HMMsBiological applications of HMMs

Radiation hybrid mappingRadiation hybrid mapping

Genetic linkage mapping Genetic linkage mapping

Phylogenetic analysisPhylogenetic analysis

Protein homology recognitionProtein homology recognition

Protein secondary structure predictionProtein secondary structure prediction

Profile HMM librariesProfile HMM libraries

Gene findingGene finding

(Krushyak et al., 1996)

(Felsenstein & Churchill, 1996)

(Goldman et al., 1996)

(Birney & Durbin, 1997;Henderson, 1997; Krogh, 1997;Lukashin & Borodovsky, 1998)

(PROSITE; Pfam database)

(Karplus et al., 1999)

(Sloniw et al., 1997)

a b a Observed symbol sequence

A B End

P1(a)

P1(b)

P2(a)

P2(b)

t 1,2 t 2,end

t 1,1

P(aba|HMM) = t 1,2 t 2,endt 1,1P1(a) P1(b) P2(a)

HMM


t 2,2

states

transitions

E 5 End0.1 1.0

0.9


states

transitions

I

0.9

Start1.0 0.1

PA=(0.25)PC=(0.25)PG=(0.25)PT=(0.25)

PA=(0.05)PC=(0)PG=(0.95)PT=(0)

PA=(0.4)PC=(0.1)PG=(0.1)PT=(0.4)


Markov Models assume that sequences are generated independently of the modelMarkov Models assume that sequences are generated independently of the model

Applied to time series or to linear sequencesApplied to time series or to linear sequences

A C D E F G H I ..

. Y

Basic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMM

Start d1 End

m1

d2 d3

C

i3i2i1i0

A C D E F G H I ..

. Y

m2C

A C D E F G H I ..

. Y

m3Y

Probabilities

0.3

0.015

0.06

0.01 0.5

0.01

Match states

Model the distributionof symbols in the correspondingcolumn of an alignment

AlignmentAlignment

from information contained in

Methods in Gene PredictionMethods in Gene PredictionMethods in Gene PredictionMethods in Gene Prediction

Ab initio analysis of genomic sequences:

Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)

Comparison of protein and genomic sequences:

Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)

Cross-species genomic sequence comparisons:

CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)

Ab initio analysis of genomic sequences:

Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)

Comparison of protein and genomic sequences:

Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)

Cross-species genomic sequence comparisons:

CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)

Gene prediction programs (many with homology searching capabilities)

GeneMachine http://genome.nhgri.nih.gov/genemachine

Genscan http://genome.dkfz-heidelberg.de

GenomeScan http://genes.mit.edu/genomescan

Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml

Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml

HMMgene http://www.cbs.dtu.dk/services/HMMgene

Genie http://www.fruitfly.org/seq_tools/genie.html

GeneMark http://www.ebi.ac.uk/genemark

GeneID http://www1.imim.es/software/geneid/geneid.html#top

GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html

MZEF and POMBE http://argon.cshl.org/genefinder/

AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html

MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html

Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg

WebGene http://www.itba.mi.cnr.it/webgene

GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html

Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound

Gene-prediction programs: alignment based

Procrustes http://www-hto.usc.edu/software/procrustes/index.hl

GeneWise2 http://www.sanger.ac.uk/Software/Wise2

SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi

PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html

Gene prediction programs (many with homology searching capabilities)

GeneMachine http://genome.nhgri.nih.gov/genemachine

Genscan http://genome.dkfz-heidelberg.de

GenomeScan http://genes.mit.edu/genomescan

Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml

Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml

HMMgene http://www.cbs.dtu.dk/services/HMMgene

Genie http://www.fruitfly.org/seq_tools/genie.html

GeneMark http://www.ebi.ac.uk/genemark

GeneID http://www1.imim.es/software/geneid/geneid.html#top

GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html

MZEF and POMBE http://argon.cshl.org/genefinder/

AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html

MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html

Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg

WebGene http://www.itba.mi.cnr.it/webgene

GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html

Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound

Gene-prediction programs: alignment based

Procrustes http://www-hto.usc.edu/software/procrustes/index.hl

GeneWise2 http://www.sanger.ac.uk/Software/Wise2

SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi

PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html

Gene-prediction programs: comparative genomics

Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan

SLAM http://bio.math.berkeley.edu/slam

Twinscan http:/ genes.cs.wustl.edu

Finding ORFs and splice sites

DioGenes http://www.cbc.umn.edu/diogenes/index.html

OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html

YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi

CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html

Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html

NetGene2 http://www.cbs.dtu.dk/services/NetGene2

RNA gene prediction

tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

Gene-prediction programs: comparative genomics

Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan

SLAM http://bio.math.berkeley.edu/slam

Twinscan http:/ genes.cs.wustl.edu

Finding ORFs and splice sites

DioGenes http://www.cbc.umn.edu/diogenes/index.html

OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html

YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi

CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html

Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html

NetGene2 http://www.cbs.dtu.dk/services/NetGene2

RNA gene prediction

tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search

• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search

FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.

• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.

GENIEGENIE (UCLA) (UCLA)GENIEGENIE (UCLA) (UCLA)

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements

• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements

GrailEXPGrailEXPGrailEXPGrailEXP

1066 1345 2427 3058

1011 tata

• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.

• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.

GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.

• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.

TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)

TWINSCAN

GENSCAN

actual gene structure

alignments to human genomic sequences repeat sequences reported

Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome

Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome

1066 1345 2427 3058 1066 1345 2427 3058

1011 tata 1011 tata

cDNAcDNA

proteinprotein

• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.

• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.

GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)

• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).

• Common pitfall: for which organism was the application developed?

• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.

• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.

• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).

• Common pitfall: for which organism was the application developed?

• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.

• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.

How(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics tools

Geneprediction

tool

Geneprediction

tool

Evaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction Tools

The ideal testset is a segment of DNA for which all genes have been described experimentally.

The ideal testset is a segment of DNA for which all genes have been described experimentally.

Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%

Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%

Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%

Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%

Accuracy versus G+C content

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan Morgan MZEF

Acc

ura

cy 0 - 40% 40 - 50% 50 - 60% 60 - 100%

http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html

Accuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C content

Exon accuracy

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan MZEF

Exo

n a

ccu

racy

Sensitivity(false negatives)

Specificity(false positives)

Partially correctpredicted


Exon accuracyExon accuracyExon accuracyExon accuracy

Accuracy versus exon length

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan MZEF

Acc

ura

cy

0 - 24

25 - 49

50 - 74

75 - 99

100 - 199

200 - 299

300 +


Accuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon length

Accuracy versus exon type

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

FGENES GeneMark Genie Genscan HMMgene Morgan

Ac

cu

rac

y Initial

Internal

Terminal

Single


Accuracy versus exon typeAccuracy versus exon typeAccuracy versus exon typeAccuracy versus exon type

Open Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future Directions

•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.

•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.

•Promoter recognition

•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.

•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.

•Promoter recognition

estructura de gene procariota promoter cds terminator transcription genomic dna mrna protein utr...

Documents

bp slide

negative slide

binding of rna polymerase

diferentes funciones

promoter recognition

bp promoter sequence

rna polymerase subunit

tatabox proximal promoter