estructura de gene procariota promoter cds terminator transcription genomic dna mrna protein utr...
Post on 19-Dec-2015
223 views
TRANSCRIPT
Estructura de gene procariotaEstructura de gene procariotaEstructura de gene procariotaEstructura de gene procariota
Promoter CDS Terminator
transcription
Genomic DNA
mRNA
protein
UTR UTR
translation
Operons:
5’+1
3’mRNA
Prokaryotic Prokaryotic Gene OrganisationGene OrganisationProkaryotic Prokaryotic Gene OrganisationGene Organisation
Repressor
or
Activator
RNA
Polymerase
Promoter
Transcription: 2 consensus sequences and the startpoint- 10: TAATA T80A95t45A60a50T96
- 35: TTGACA T82T84G78A65C54a45
DNA
Ribosome
Translation: rbs (ribosomal binding site)Shine Delgarno AGGAGG
Leader Spacer Tailer
Promoter
promotores and reguladoresen Procariotas
promotores and reguladoresen Procariotas
• Promoter determines:1. Which strand will serve as a template.2. Transcription starting point.3. Strength of polymerase binding.
• RNA polymerase subunit for promoter recognition is called sigma-factor
– Different variations (7 for E. coli)– Consensus binding sequences (Table 6.2 in textbook)
• Operons for co-transcription• Regulators affect the binding of RNA polymerase
to DNA (positive and negative)
Ejemplo de promotor procariotaEjemplo de promotor procariota
• Pribnow box located at –10 (6-7bp)
• Promoter sequence located at -35 (6bp)
Secuencias ConsensoSecuencias Consenso
• Promoters sequences can vary tremendously.
• RNA polymerase recognizes hundreds of different promoters
TerminadoresTerminadoresTerminadoresTerminadores
• The terminator region pauses the polymerase and causes disassociation.
• The final mRNA may represent less than 5% of the transcribed DNA sequence
Producción de un ARN maduro en eucariotas
Región reguladora EXON 1 EXON 2 EXON 3 EXON 4 EXON n Región reguladora
PROMOTOR
5` 3`
Intrón 1 Intrón 2 Intrón 3
Unidad de transcripción
Secuencia que no se traduce Secuencia que no se traduce
Modelo simplificado de un gen humano
Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las
secuencias de los exones y las no codificantes (intrones y UTRs).
Después del procesamiento postranscripcional del ARN transcrito primario, la secuencia de ARNm corresponde a las
secuencias de los exones y las no codificantes (intrones y UTRs).
Los genes eucariotas contienen normalmente intrones
Tipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotasTipos de genes en eucariotas
Protein encoding genes• Transcription• RNA Polymerase II dependent promoters
RNA coding genes• Transcription • RNA Polymerase I and III dependent promoters• Type I and III splicing • No polyadenylation• No translation
• Type II splicing • Polyadenylation (exception histone mRNAs)• Translation
Estructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariotaEstructura de un gen eucariota
• TATA box located at –25– TATA(A/T)A(A/T) – Recognized by TATA-binding protein
• Initiator sequence at +1– YYCARR; Y is C/T, R is G/A– +1 is usually the A
• Transcription factors bind to promoters– Position specific scoring matrix (PSSM)
• Possible distant regions acting as enhancers or silencers (even more than 50 kb). – More complex mechanism than prokaryotes
La transcriptción puede ser modificada por factores que actuan en trans: activadores (enhancers) y silenciadores
Contains domainsthat adhere to cellsurfaces
Lacks domainsthat adhere to cellsurfaces
El splicing alternativo puede producir diferentes proteinas con
diferentes funciones
El splicing alternativo puede producir diferentes proteinas con
diferentes funciones
Eukaryotic Eukaryotic Gene OrganisationGene OrganisationEukaryotic Eukaryotic Gene OrganisationGene Organisation
Transcription: core promoter: loosely conserved initiator region (Inr) around TSS
~ - 25: TATA-box proximal promoter: ~ - 75: CAT (CCAAT)
~ - 170: GC-box
enhancer/silencer: upstream or downstream to promoter
Core PromoterProximal Promoter
TSS
TATAGC InrCAAT
Promoter
coreproximal
Translation:
• 5‘ Kozak sequence: GCCACCATG
• 3‘ polyadenylation site: AATAAA
Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure
Eukaryote gene structureEukaryote gene structure vs. prokaryote gene structure vs. prokaryote gene structure
• No operons
• Capping at 5’ end and polyadenylation at 3’ end– Transport of mRNA out of nucleus– Effects stability and efficiency of translation
• Introns
• Alternative splicing
ResumenResumenResumenResumen
• Prokaryotic genes
• Eukaryotic genes
gene genegenepromoter
start stop
terminator
exon exonexonpromoter
start stopdonor acceptor
intron intron
Gene prediction: Prokaryotes vs. Eukaryotes
Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)
• Contiguous open reading frames (ORF)
• Polycistronic mRNAs
• Short intergenic sequences
Good method: detecting large ORFs
• Complications: • Sequencing errors
• very small genes will be missed
• Overlapping genes on both strands
Prokaryotes• Conserved promoter region (-10, -35; fixed spacing)
• Contiguous open reading frames (ORF)
• Polycistronic mRNAs
• Short intergenic sequences
Good method: detecting large ORFs
• Complications: • Sequencing errors
• very small genes will be missed
• Overlapping genes on both strands
Promoter and Gene prediction: Prokaryotes vs. Eukaryotes
Promoter and Gene prediction: Prokaryotes vs. Eukaryotes
•Promoter elements •core promoter
•initiator region (Inr)•TATA box•Downstream promoter element (DPE)
•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes
•Enhancers/silencers sites (less useful)
•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-
adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)
•content sensors (base composition, codon usage, hexamer usage)
•Promoter elements •core promoter
•initiator region (Inr)•TATA box•Downstream promoter element (DPE)
•proximal promoter: transcription factor (“TF”) binding sites•CAAT box,•GC box•SP-1 sites•GAGA boxes
•Enhancers/silencers sites (less useful)
•Coding sequence •signal sensors (start and stop signals (Kozak sequence, stop codons), Poly-
adenylation signals, Splicing signals (3‘, 5‘ splice sites, splice junction, branchpoint)
•content sensors (base composition, codon usage, hexamer usage)
• The speed with which new data are collected increases and exceedes the rate with which they could be analysed.
• Whole-genome sequences for more than 800 organisms (bacteria, archaea, and eukaryota as well as many viruses and organells) are either complete or being determined.
• Across all sequenced species, nearly half of the potential genes can not be assigned a specific role.
El retoEl reto
Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar
todos los genestodos los genes
Los programas para la predición de genes deberían Los programas para la predición de genes deberían ser capáces de identificar automáticamente y anotar ser capáces de identificar automáticamente y anotar
todos los genestodos los genes
Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction
Three Basic Strategies for Three Basic Strategies for Promoter and Gene PredictionPromoter and Gene Prediction
• Búsqueda por homología
• Análisis de señales en las secuencias
• Análisis estadísticos
Evolutionary relationships
ancestor
species 1 species 2 species 3
Paralogues: homologous proteins that perform different but related functions within one organism.
Orthologues: homologous proteins that perform the same function in different species.
¿Porqué homología?¿Porqué homología?¿Porqué homología?¿Porqué homología?
• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.
• Investigate sequence databases such as EMBL or Swissprot with programs such as BLAST or TFASTA.• Orthologs / homologs / paralogs may have been described. Sequence identity may be low; several approaches should be tried. • As more sequence data is collected, this initial step becomes more important.
Low coverage, high accuracyLow coverage, high accuracy
Homology SearchingHomology SearchingHomology SearchingHomology Searching
Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction
Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction
• Homology searching
• Analysis of sequence signals
• Statistical analysis
• Homology searching
• Analysis of sequence signals
• Statistical analysis
¿Que señales se pueden emplear en
bioinformática para la predicción de
genes?
¿Que señales se pueden emplear en
bioinformática para la predicción de
genes?
Genomic sequences tend towards randomness;Genes are non-random.
Genomic sequences tend towards randomness;Genes are non-random.
¿Que diferencia a los genes de otras secuencias genómicas ?
¿Que diferencia a los genes de otras secuencias genómicas ?
Translated DNA sequences are restricted in the choice for nucleotides in the first, second (and to a lesser extend) third position of the codons.
Occurrence of a certain base in first, second and third position of the potential codons will not be random.
123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT
A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)
123123123123123123123123123123123123123123123123123123ATGATAGCTATACGGATCCGTAGCTAGATCAGTAGCGTGACTGCTGTCGTCATT
A(1,4,7...)=10 of 18 (Random sequence Exp=25%)A(2,5,8...)=1 of 18 (Random sequence Exp=25%)
Confidence levels can be calculated because large sets of coding and non-coding sequences have been analyzed.
Base compositionBase composition
Frequency of the four different nucleotides at the different codon positions in human coding regions.
Frequency of the four different nucleotides at the different codon positions in human coding regions.
Base composition biasBase composition biasBase composition biasBase composition bias
1066 1345 2427 3058
Growth Factor Mouse
Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons
Not an easily predictable gene !
Growth Factor Mouse
Weakly expressed, tissue specificGC-rich (57%; cds 66%)’TATA’ promoter (1011-1017)2 exons
Not an easily predictable gene !
1011 tata
Bottner M, Laaff M, Schechinger B, Rappold G, Unsicker K, Suter-Crazzolara C. Gene. 1999 (237):105-11 .
Our model geneOur model gene
‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.
‘Period three constraint’ [J. Fickett, Nucl. Acids Res. 10(17); 5303-5318 (1982)]. The top and bottom regions predict coding and non-coding regions to a 95% confidence level. Start and stop codons (dashes and diamonds) are indicated.
codingcoding
non-codingnon-coding
TestcodeTestcodeTestcodeTestcode
Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.
Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.
Advantages:• Input: the crude DNA sequence • No information on reading frames is necessary.• No information on organism specific codon usage is needed.
Disadvantages:• Short exons (<200bp) are ignored.• Frameshift errors reduce the prediction success.
Base composition biasBase composition biasBase composition biasBase composition bias
Codon usage bias Codon usage bias
The frequency of usage of
each codon (per thousand)
in human coding regions.
The relative frequency of each codon among synonymous codons.
The human codon usage tableThe human codon usage table
(http://www.kazusa.or.jp/codon/)(http://www.kazusa.or.jp/codon/)
Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)
Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4
Frequency dependent on species, level of gene expression.
Leucine : Alanine : Tryptophan Protein encoding DNA = 6.9 : 6.5 : 1 Random DNA = 6.0 : 4.0 : 1(Species specific, example rat)
Most amino acids are encoded by more than one codon.Leucine TTG TTA CTG CTA CTT CTC human 12.5 7.2 40.2 6.9 12.7 19.4rat 12.4 5.0 40.8 7.0 11.2 20.4xenopus 14.4 9.1 26.1 8.4 15.9 12.6yeast 27.1 26.4 10.4 13.4 12.2 5.4
Frequency dependent on species, level of gene expression.
Frequency of usage
Relative Frequency
Codon usage biasCodon usage biasCodon usage biasCodon usage bias
Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed
Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success
Advantages:• Input: the crude DNA sequence AND a codon frequency table• No information on reading frame needed
Disadvantages:• Weakly expressed genes have little bias• Frameshift errors reduce the prediction success
Codon usage biasCodon usage biasCodon usage biasCodon usage bias
Analysis of Sequence SignalsAnalysis of Sequence Signals
Content Sensors (Large sequence motifs):
• base composition• codon usage• hexamer usage
Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)
Content Sensors (Large sequence motifs):
• base composition• codon usage• hexamer usage
Signal Sensors (Short sequence motifs): • Start/stop codons• Splicing signals (3‘, 5‘ signals, branchpoint, splice junctions)• Polyadenylation signals • Transcription regulation signals (TF binding sites, promoters)
String matchingString matchingString matchingString matching
Input: Output: A text string t of length n. All instances of the
pattern in the text. A patterns string p of length m.
• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.
• Disadvantage: many false positives.
• Use consensus sequence (pattern) for splice site, Kozak sequence or transcription factor binding site.
• Disadvantage: many false positives.
...ATGATAGATATACAGATTATATAGATCGAT...
...ATGATAGATATACAGATTATATAGATCGAT...
PatternsPatternsPatternsPatterns
TATA
TATA-boxTATA
Stop codonsUGA, UAA, UAG
Stop codonsUGA, UAA, UAG
StartcodonGCCACCAUGGKozak sequence
StartcodonGCCACCAUGGKozak sequence
Polyadenylationsignals
YGUGUUYY (N)20-30 AAUAAA
Polyadenylationsignals
YGUGUUYY (N)20-30 AAUAAA
Termination sequences
(not well defined in eukaryotes)
Termination sequences
(not well defined in eukaryotes)
55’ splice site
CAG/GTAAGTAG
33’ splice site
(T)10NCAG/G(C) 9
Bbranchpoint
CT(G/A)A(C/T)
Jsplicejunction
MAG/G
55’ splice site
CAG/GTAAGTAG
33’ splice site
(T)10NCAG/G(C) 9
Bbranchpoint
CT(G/A)A(C/T)
Jsplicejunction
MAG/G
5 53 3
B B
BB
J J
+ 3
3
Splice SitesSplice SitesSplice SitesSplice Sites
• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.
• Replace the pattern by a profile• Employ training sets to build profile and to optimize the algorithm.
Profile or Profile or Position Weigth MatrixPosition Weigth MatrixProfile or Profile or Position Weigth MatrixPosition Weigth Matrix
1234567...
ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...
1234567...
ACATTAA...TCAGAAT...ACAGAAC...AGATTAC...ACCGAAC...
1234567...
A 4040351...
C 0410003...
G 0103000...
T 1002201...
consensus ACAGAAC...
1234567...
A 4040351...
C 0410003...
G 0103000...
T 1002201...
consensus ACAGAAC...
Alignment Profile
Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction
Three Basic Strategies for Three Basic Strategies for Gene PredictionGene Prediction
• Homology searching
• Analysis of sequence signals
• Statistical analysis
• Homology searching
• Analysis of sequence signals
• Statistical analysis
A C D E F G H GapY TSL ... ...
Score for finding each aa at a certain position
POS1
2 3
4 56
...
... ... -2 -2 -2 -2 18
-42
115 895 -65 -223 -104-64
-82 -
302 -
142 -62 -121
-221
-121 -
401 -
241-81
-101
-161
-103
-203 -
283 -
223 -163-23
56 -304
416 196
-163
-223
-101 -302
-221 38 -181-181
-61 -102 -181 -
81 218-42
-103 -103
-343 -302 -43176
-21 -101 -21
139 -159-121
-101 -202 -282 -162 -182
-62
30 100 100 100 10030
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles
What is a Gribskov Profile?What is a Gribskov Profile?
A Gribskov profile is a weight matrix of the probabilities of appearance of amino acids in a certain position in a multiple alignment.
Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles
Differences between Gribskov Profiles and commonsequence comparison methodsDifferences between Gribskov Profiles and commonsequence comparison methods
A group of related sequences can be used to build the profile
The profile includes position-specific penalties for insertion and deletion
Gribskov ProfilesGribskov ProfilesGribskov ProfilesGribskov Profiles
What is needed to create a Gribskov Profile?What is needed to create a Gribskov Profile?
A group of functionally related proteins
GlobinsImmunoglobulins
Aligned by
Similarity
Three dimensional structure
1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~
A mutational distance matrix
Blosum62PAM250 Dayhoff
1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~
Gribskov ProfilesGribskov Profiles
Alig
ned
posit
ion
s
Score of each aa at a certain position
Sequence position-specific scoring matrix M(p,a)Sequence position-specific scoring matrix M(p,a)
A C D E ................ W Y
1234...N
Gap
Penalty for deletion or insertion in that position
Number of positions in the alignment
21 Columns
20 of them specify1 specifies
Y(a,b)
Value in the mutational distance matrix
M(p,C)= W(p,W) * Y(C,W)
The profile is filled using the The profile is filled using the Multiple alignment
Mutational distance matrix
Creating a Gribskov ProfileCreating a Gribskov Profile
W(p,b)
Weight of appearance of aa b at position p
= n(b,p)/ NR
n(b,p) is the number of times that aa b appears in position p
NR number of rows in the alignment
M(p,a)= b=1 W(p,b) * Y(a,b)20
A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 B -2 6 C 0 -3 9 D -2 6 -3 6E -1 2 -4 2 5 F -2 -3 -2 -3 -3 6 G 0 -1 -3 -1 -2 -3 6 H -2 -1 -3 -1 0 -1 -2 8 I -1 -3 -1 -3 -3 0 -4 -3 4 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
W
C
-2
Mutational Distance MatrixMutational Distance MatrixMutational Distance MatrixMutational Distance MatrixBlosum62 matrixBlosum62 matrix
The profile is filled using the The profile is filled using the Multiple alignment
Mutational distance matrix
W(p,b)
Weight of appearance of aa b at position p
= n(b,p)/ NR Y(a,b)
Value in the mutational distance matrix
Creating a Gribskov ProfileCreating a Gribskov Profile
n(b,p) is the number of times that aa b appears in position p
NR number of rows in the alignment
M(p,C)= W(p,W) * Y(C,W)
Y(C,W) = -2
M(p,a)= b=1 W(p,b) * Y(a,b)20
Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~
A C D E F G H GapY TSL ... ...
Score for finding each aa at a certain position
POS1
2 3
4 56
M(p,a)= b=1 W(p,b) * Y(a,b)20
M(1,A)= b=1 W(1,b) * Y(A,b)
M(1,A)= ( W(1,A) * Y(A,A) ) + (W(1,C) * Y(A,C) ) +......+ ( W(1, Y) *Y(A,Y) )
M(1,A)= ( 0.025/6 * 4) + ( 1/6 * 0 ) +......+ ( 0.025/6 * -1)
Consensus sequence
symbol with largest value in each position(CCGGTL)
Creating a Gribskov ProfileCreating a Gribskov Profile
...
... ... -2 -2 -2 -2 18
-42
115 895 -65 -223 -104-64
-82 -
302 -
142 -62 -121
-221
-121 -
401 -
241-81
-101
-161
-103
-203 -
283 -
223 -163-23
56 -304
416 196
-163
-223
-101 -302
-221 38 -181-181
-61 -102 -181 -
81 218-42
-103 -103
-343 -302 -43176
-21 -101 -21
139 -159-121
-101 -202 -282 -162 -182
-62
30 100 100 100 10030
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
M(1,C)= b=1 W(1,b) * Y(C,b)
-2
aa not present in a position get a very small weight 0,025/NR
aa not present in a position get a very small weight 0,025/NR
Alignment 1 seq1.pep ~CCGTLseq2.pep GCGSL~seq3.pep ~CGHSVseq4.pep ~CGGTLseq5.pep CCGSS~
A C D E F G H GapY TSL ... ...
Score for finding each aa at a certain position
POS1
2 3
4 56
Consensus sequence
symbol with largest value in each position
(CCGGTL)
Scoring with a Gribskov ProfileScoring with a Gribskov Profile
...
... ... -2 -2 -2 -2 18
-42
115 895 -65 -223 -104-64
-82 -
302 -
142 -62 -121
-221
-121 -
401 -
241-81
-101
-161
-103
-203 -
283 -
223 -163-23
56 -304
416 196
-163
-223
-101 -302
-221 38 -181-181
-61 -102 -181 -
81 218-42
-103 -103
-343 -302 -43176
-21 -101 -21
139 -159-121
-101 -202 -282 -162 -182
-62
30 100 100 100 10030
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
-2
P(CCGGTL)= Pp1(C)* Pp2(C)* Pp3(G)* Pp4(G)* Pp5(T) * Pp6(L)
P(CCGGTL)= log Pp1(C)+ log Pp2(C)+ log Pp3(G)+ log Pp4(G)+ log Pp5(T)+ log Pp6(L)
Probability of any sequence is calculated in the same wayProbability of any sequence is calculated in the same way
IntroductionIntroductionIntroductionIntroduction
Gribskov ProfileGribskov Profile
Hidden Markov ModelsHidden Markov Models
Definition
Scoring a sequence with an HMM
Building a Hidden Markov ModelBuilding a Hidden Markov Model
State order of an HMM
Biological application of HMMs
Scoring a sequence with a Profile
Basic Architecture
HMM programs in HUSAR
Estimation of the modelProblems building an HMM
DefinitionCreating a Gribskov Profile
Advantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov ModelsAdvantages of using Markov Models
C G
C
-
T
AP=0.6
P=0.1
P=0.2
P=0.09
P=0.01
Markov Models are probabilistic, models, with a solid statistical foundation
Markov Models are probabilistic, models, with a solid statistical foundation
In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions
In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions
In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues.
Markov Model is based on active domains only !!
Domain 1 (active binding site) ATGTCGTCGTCGDomain 2 (never found, inactive) ATGTGGTCGTCGDomain 3 (never found, inactive) ATGTCATCGTCGDomain 4 (active) ATGTGATCGTCG
Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model
If a G is found at position 3, P(T)4=1.0
If a T is found at position 4, P(C)5=0.5, P(G)5=0.5
If an A is found at position 6, P(T)7=1.0
If a C is found at position 5, P(A)6=0.0, P(G)6=1.0
If a G is found at position 5, P(A)6=1.0, P(G)6=0.0
If a G is found at position 6, P(T)7=1.0
123456…
First order Markov ModelFirst order Markov Model
Order state of HMMsOrder state of HMMsOrder state of HMMsOrder state of HMMs
Fifth order Markov ModelFifth order Markov Model
Captures the first order correlation between neighboring nucleotides
HMM models can use preceding, succeeding or surrounding residues
There is no real limit in the number of preceding residues that can be used for an HMM (computing time!)
Markov Models take into account additional information about neighboring residues.Markov Models take into account additional information about neighboring residues.
Biological applications of HMMsBiological applications of HMMsBiological applications of HMMsBiological applications of HMMs
Radiation hybrid mappingRadiation hybrid mapping
Genetic linkage mapping Genetic linkage mapping
Phylogenetic analysisPhylogenetic analysis
Protein homology recognitionProtein homology recognition
Protein secondary structure predictionProtein secondary structure prediction
Profile HMM librariesProfile HMM libraries
Gene findingGene finding
(Krushyak et al., 1996)
(Felsenstein & Churchill, 1996)
(Goldman et al., 1996)
(Birney & Durbin, 1997;Henderson, 1997; Krogh, 1997;Lukashin & Borodovsky, 1998)
(PROSITE; Pfam database)
(Karplus et al., 1999)
(Sloniw et al., 1997)
a b a Observed symbol sequence
A B End
P1(a)
P1(b)
P2(a)
P2(b)
t 1,2 t 2,end
t 1,1
P(aba|HMM) = t 1,2 t 2,endt 1,1P1(a) P1(b) P2(a)
HMM
Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model
t 2,2
states
transitions
E 5 End0.1 1.0
0.9
Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model
states
transitions
I
0.9
Start1.0 0.1
PA=(0.25)PC=(0.25)PG=(0.25)PT=(0.25)
PA=(0.05)PC=(0)PG=(0.95)PT=(0)
PA=(0.4)PC=(0.1)PG=(0.1)PT=(0.4)
Hidden Markov ModelHidden Markov ModelHidden Markov ModelHidden Markov Model
Markov Models assume that sequences are generated independently of the modelMarkov Models assume that sequences are generated independently of the model
Applied to time series or to linear sequencesApplied to time series or to linear sequences
A C D E F G H I ..
. Y
Basic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMMBasic Architecture of a profile HMM
Start d1 End
m1
d2 d3
C
i3i2i1i0
A C D E F G H I ..
. Y
m2C
A C D E F G H I ..
. Y
m3Y
Probabilities
0.3
0.015
0.06
0.01 0.5
0.01
Match states
Model the distributionof symbols in the correspondingcolumn of an alignment
AlignmentAlignment
from information contained in
Methods in Gene PredictionMethods in Gene PredictionMethods in Gene PredictionMethods in Gene Prediction
Ab initio analysis of genomic sequences:
Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)
Comparison of protein and genomic sequences:
Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)
Cross-species genomic sequence comparisons:
CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)
Ab initio analysis of genomic sequences:
Genscan (Burge and Karlin 1997) HMMer (Haussler et al. 1993, Krogh et al. 1994) FGenesH (Solovyev and Salamov 1994)
Comparison of protein and genomic sequences:
Procrustes (Gelfand et al. 1996) Genewise (Birney and Durbin)
Cross-species genomic sequence comparisons:
CEM (Bafna and Huson 2000) TWINSCAN (Korf et al. 2001) Doublescan Meyer and Durbin 2002) SLAM (Alexandersson et al. 2003)
Gene prediction programs (many with homology searching capabilities)
GeneMachine http://genome.nhgri.nih.gov/genemachine
Genscan http://genome.dkfz-heidelberg.de
GenomeScan http://genes.mit.edu/genomescan
Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml
Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml
HMMgene http://www.cbs.dtu.dk/services/HMMgene
Genie http://www.fruitfly.org/seq_tools/genie.html
GeneMark http://www.ebi.ac.uk/genemark
GeneID http://www1.imim.es/software/geneid/geneid.html#top
GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html
MZEF and POMBE http://argon.cshl.org/genefinder/
AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html
MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html
Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg
WebGene http://www.itba.mi.cnr.it/webgene
GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html
Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound
Gene-prediction programs: alignment based
Procrustes http://www-hto.usc.edu/software/procrustes/index.hl
GeneWise2 http://www.sanger.ac.uk/Software/Wise2
SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html
Gene prediction programs (many with homology searching capabilities)
GeneMachine http://genome.nhgri.nih.gov/genemachine
Genscan http://genome.dkfz-heidelberg.de
GenomeScan http://genes.mit.edu/genomescan
Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNAS http://genomic.sanger.ac.uk/gf/gf.shtml
Fgenesh, Fgenes-M, SPL and RNASPL http://www.softberry.com/berry.phtml
HMMgene http://www.cbs.dtu.dk/services/HMMgene
Genie http://www.fruitfly.org/seq_tools/genie.html
GeneMark http://www.ebi.ac.uk/genemark
GeneID http://www1.imim.es/software/geneid/geneid.html#top
GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html
MZEF and POMBE http://argon.cshl.org/genefinder/
AAT, MZEF with homology http://genome.cs.mtu.edu/aat.html
MZEF with SpliceProximalCheck http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html
Genesplicer, Glimmer and GlimmerM http://www.tigr.org/~salzberg
WebGene http://www.itba.mi.cnr.it/webgene
GenLang http://www.cbil.upenn.edu/genlang/genlang_home.html
Xpound ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound
Gene-prediction programs: alignment based
Procrustes http://www-hto.usc.edu/software/procrustes/index.hl
GeneWise2 http://www.sanger.ac.uk/Software/Wise2
SplicePredictor http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
PredictGenes http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html
Gene-prediction programs: comparative genomics
Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM http://bio.math.berkeley.edu/slam
Twinscan http:/ genes.cs.wustl.edu
Finding ORFs and splice sites
DioGenes http://www.cbc.umn.edu/diogenes/index.html
OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html
YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi
CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html
Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html
NetGene2 http://www.cbs.dtu.dk/services/NetGene2
RNA gene prediction
tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
Gene-prediction programs: comparative genomics
Doublescan http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM http://bio.math.berkeley.edu/slam
Twinscan http:/ genes.cs.wustl.edu
Finding ORFs and splice sites
DioGenes http://www.cbc.umn.edu/diogenes/index.html
OrfFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html
YeastGene http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi
CDS: search coding regions http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html
Neural network splice site prediction http://www.fruitfly.org/seq_tools/splice.html
NetGene2 http://www.cbs.dtu.dk/services/NetGene2
RNA gene prediction
tRNAScan http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search
• Victor Solovyev and coleagues • FGENE applications are based on HMMs• They form a complete, partially automated, modular package• Dynamic modelling with various features of coding sequences• Precise determination of exon borders with homology search
FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)FGENES, FGENEH, FGENESH(+)
1066 1345 2427 3058 1066 1345 2427 3058
1011 tata 1011 tata
• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.
• Combination of statistical methods (HMM) and neural networks • A candidate sequence is "threaded" through the HMM using a min-cost path search algorithm and the system reports this "optimal" path as the predicted gene structure.
GENIEGENIE (UCLA) (UCLA)GENIEGENIE (UCLA) (UCLA)
1066 1345 2427 3058 1066 1345 2427 3058
1011 tata 1011 tata
• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements
• Widely used for genbank annotations• GrailEXP predicts exons, genes, promoters, polyAs, CpG islands, EST similarities, and repetitive elements
GrailEXPGrailEXPGrailEXPGrailEXP
1066 1345 2427 3058
1011 tata
• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.
• Genescan employs a dynamic programming strategy.• General three-periodic (inhomogeneous) fifth order Markov Model.• Transcription-, translation- and splicing signals.• Length distributions and compositional features of introns, exons and intergenic regions.• Exceptional: It was developed to recognize partial and multiple genes on both strands.• Independent of databases.
GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)GenscanGenscan (Chris Burge and Samuel (Chris Burge and Samuel Karlin)Karlin)
1066 1345 2427 3058 1066 1345 2427 3058
1011 tata 1011 tata
• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.
• TWINSCAN models both gene structure and evolutionary conservation• Scores of features (e.g. splice sites, coding regions) are modified using the patterns of divergence between the target genome and a closely related genome.
TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)TWINSCANTWINSCAN (I. Korf et al., (I. Korf et al., 2001)2001)
TWINSCAN
GENSCAN
actual gene structure
alignments to human genomic sequences repeat sequences reported
Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome
Prediction of a subsequence of the mouse Prediction of a subsequence of the mouse genomegenome
1066 1345 2427 3058 1066 1345 2427 3058
1011 tata 1011 tata
cDNAcDNA
proteinprotein
• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.
• finding genes in microbial DNA.• combination of Markov models from first through eighth order, weighting each model according to its predictive power. • Widely used for genbank annotations.
GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)GLIMMERGLIMMER (Salzberg and colleagues, (Salzberg and colleagues, JHU)JHU)
• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).
• Common pitfall: for which organism was the application developed?
• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.
• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.
• No single bioinformatics tool is 100 % accurate (colleagues and developers may tell you the opposite).
• Common pitfall: for which organism was the application developed?
• Repetitive elements (such as the mouse L1 element) can be wrongly recognized as genes.
• Bioinformatics rule: try several approaches, try to understand why they may give apparently contradicting results.
How(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics toolsHow(not) to use bioinformatics tools
Geneprediction
tool
Geneprediction
tool
Evaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction ToolsEvaluation of Gene Prediction Tools
The ideal testset is a segment of DNA for which all genes have been described experimentally.
The ideal testset is a segment of DNA for which all genes have been described experimentally.
Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%
Specificity = true predicted / all predicted Measure for false positives: 9 / 11 = 81.8%
Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%
Sensitivity = true predicted / true genes Measure for false negatives: 9 / 10 = 90%
Accuracy versus G+C content
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
FGENES GeneMark Genie Genscan Morgan MZEF
Acc
ura
cy 0 - 40% 40 - 50% 50 - 60% 60 - 100%
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
Accuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C contentAccuracy versus G+C content
Exon accuracy
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
FGENES GeneMark Genie Genscan HMMgene Morgan MZEF
Exo
n a
ccu
racy
Sensitivity(false negatives)
Specificity(false positives)
Partially correctpredicted
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
Exon accuracyExon accuracyExon accuracyExon accuracy
Accuracy versus exon length
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
FGENES GeneMark Genie Genscan HMMgene Morgan MZEF
Acc
ura
cy
0 - 24
25 - 49
50 - 74
75 - 99
100 - 199
200 - 299
300 +
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
Accuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon lengthAccuracy versus exon length
Accuracy versus exon type
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
FGENES GeneMark Genie Genscan HMMgene Morgan
Ac
cu
rac
y Initial
Internal
Terminal
Single
http://www.cse.ucsc.edu/~rogic/evaluation/tablesgen.html
Accuracy versus exon typeAccuracy versus exon typeAccuracy versus exon typeAccuracy versus exon type
Open Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future DirectionsOpen Problems and Future Directions
•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.
•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.
•Promoter recognition
•Near the 90% of the nucleotides can be identified correctly, but exact boundaries of the exons and their assemblies into complete coding sequences are much more difficult to predict. Less than the half of the genes are predicted exactly correct.
•Multiple protein products correspond to a single gene through alternative splicing, alternative transcription or alternative translation has not been dealt with effectively.
•Promoter recognition