9. lecture ws 2003/04bioinformatics iii1 gene finding material of this lecture taken from - chapter...

9. Lecture WS 2003/04

Bioinformatics III 1

Gene finding

Material of this lecture taken from

- chapter 8, DW Mount „Bioinformatics“

- C. Mathé et al. Nucleic Acids Research 30, 4103 (2002)

Recent years have seen an explosion of the number of completed genome

sequences.

The Genome Online Database lists

166 published completed genomes in the public databases

415 ongoing prokaryotic sequencing projects

360 ongoing eukaryotic sequencing projects.

However, annotation (biological interpretation) is hardly keeping pace with the

new flow of data :-(



Gene finding

- sequence motifs useful for gene finding

- prokaryotic gene finding

- Hidden Markov Models (HMM)

- eukaryotic gene finding

- popular gene finding programs, comparison



Introduction

The simplest method of finding DNA sequences that encode proteins is to

search for open reading frames (ORFs).

In each sequence, there are 6 possible open reading frames:

3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction

and 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction of

the complementary sequence.

In prokaryotic genomes, DNA sequences encoding proteins are transcribed into

mRNA, and the mRNA is usually directly translated into proteins without

significant modification.

Therefore, the longest ORF running from the first available Met codon (AUG) on

the mRNA to the next stop codon in the same reading frame, is usually a good

prediction of the protein-encoding regions.



Methods

Obtain new genomic DNAsequence.

Translate in all6 ORFs and compareto protein database.

Perform database similarity search ofEST database of sameorganism, or of cDNAsequences, if available.

Use gene predictionprogram to locate genes.

Analyze regulatorysequences in the gene.



Statistical methods vs. rule-based methods

Statistical methods Rule-based methods

secondary structure prediction:

neural networks apply rules from stereochemistry

AGADIR

tertiary structure prediction:

homology modelling, threading ab initio predictions

gene prediction:

Hidden Markov Models apply rules for gene structure

neural networks

biological function:

sequence analysis predict binding sites + mechanism

structural alignment (3D structures) from 3D structure if available



Statistical methods vs. rule-based methods

Advantages Disadvantages

statistical methods:

powerful if based on large data basis weak if data basis is too small

may be quick to apply

automatic training procedures no insight in underlying principles

rule-based methods:

powerful if rules are strongly followed in biological there are rules and

many special cases

insight in underlying principles rules may only become obvious after

problem is solved by statistical method

may be very expensive may be very quick

(e.g. ab initio 3rd structure prediction)



Extrinsic and intrinsic methodsMost approaches now combine

(a) homology methods = „extrinsic methods“ with

(b) gene prediction methods = „intrinsic methods“

Only about half of all genes can be found by homology to other known genes or

proteins (this value is of course increasing as more genomes get sequenced and

more cDNA/EST sequences get available).

In order to determine the 50% of remaining genes, one has to turn to predictive

methods.

Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)



Prokaryotic genesAdvantages:

there are no introns

the intergenic regions are small

Disadvantages:

the genes may often overlap each other

the translation starts are difficult to predict.




Extrinsic Content SensorsExtrinsic content sensors identify similarities with protein/DNA database entry:

FASTA, BLAST ...

Obvious weakness: nothing will be found if the database does not contain a

sufficiently similar sequence.

Even when a good similarity is found, the limits of the regions of similarity, which

should indicate exons, are not always very precise. Small exons are easily missed.

EST/cDNA data allow to identify exons. On the other hand, EST data gives only

limited information on the gene structure.

Advantage: a single match is enough to detect the presence of a gene.




Eukaryotic genomes

Transcription of protein-encoding regions is initiated at specific promoter

sequences, and followed by removal of noncoding sequence (introns)

from pre-mRNA by a splicing mechanism.

3 types of posttranscriptional events influence the translation of mRNA into

protein and the accuracy of gene prediction:

(1) species-dependent codon usage

(2) tissue-dependent splice variations

(3) mRNA may be edited.



Intrinsic Content Sensors for eukaryotic genomesCharacterize „coding“ regions:

- nucleotide composition

- G+C content (introns are more A/T rich than exons, especially in plants)

- codon composition

- hexamer frequency (this was found to be the most discriminating variable

between coding and non-coding sequences)

- base occurrence periodicity ...

Hexamer frequence, or, more generally, the k-mer composition of coding

sequences is the main search tools in many packages.




Repeated sequence elements in eukaryotic genes

Eukaryotic DNA is wrapped around histone-protein complexes called

nucleosomes. Therefore, some of the base pairs in the major or minor grooves

of DNA face the nucleosome surface and others face outside.

Repeated patterns of sequence have been found in the introns and exons by

hidden Markov model (HMM) analysis.

Patterns appear with a periodicity of 10,

JMB 263, 503 (1996) Baldi et al.



Markov modelA Markov model is a stochastic model which assumes that the probability of

appearance of a given base (A, T, C, or G) at a given position depends only on the k

previous nucleotides.

k is called the order of the Markov model.

Such a model is defined by the conditional probabilities P(X|k previous nucleotides),

where X = A, T, G, or C.

In order to build a Markov model, a learning set of sequences on which these

probabilities will be estimated is required.

Given a sequence and a Markov model, one can then very simply compute the

probability that this sequence has been generated according to this model.




Markov modelsThe simplest Markov models are homogenuous zero order Markov models which

assume that each base occurs independently with a given frequency.

Such simple models were previously often used for non-coding regions.

Modern programs like GeneMark also use higher order models to represent introns

and intergenic regions.

More complex three-periodic Markov models have been introduced to

characterize coding sequences. Coding regions are defined by three Markov

models, one for each position inside a codon.




Markov modelsThe larger the order of a Markov model, the finer it can characterize dependencies

between adjacent nucleotides.

However, a model of order k requires a large number of coding sequences to be

reliably estimated.

Therefore, most programs like GeneMark or Genscan rely on a three-periodic

Markov model of order five (thus exploiting hexamer composition) or less to

characterize coding sequences.




Hidden Markov Models

An HMM is a graph of connected states, each

state potentially able to “emit” a series of

observations. The process evolves in some

dimension, often time, though not necessarily.

The model is parameterized with probabilities

governing the state at a time t + 1, given that

one knows the previous states. Markov

assumptions are used to truncate the

dependency of having to know the entire history

of states up to this point in order to assess the

next state: Instead, only one step back is

required. As the process evolves in time through

the states, each state can potentially emit

observations, which are regarded as a stream of

observations over time. These models are often

illustrated graphically as shown right, with the

states being circles and transitions as arrows

between the states.



Signal sensors

The basic and natural approach to finding a signal that may represent the

presence of a functional site is to search for a match with a consensus

sequence e.g. for promoter regions (TATA box), or the ribosomal binding site on

the mRNA.

This consensus could be determined from a multiple alignment of functionally

related documented sequences. Programs SPLICEVIEW and

SplicePredictor.

A more flexible representation of signals is offered by the so-called positional

weight matrices (PWMs): indicate the probability that a given base appears at

each position of the signal (again computed from a multiple alignment of

functionally related sequences).

One can say that a PWM is defined by one classical zero order Markov model

per position.

The PWM weights can also be optimized by a neural network method.




Signal sensors

In order to capture possible dependencies between adjacent positions of a signal,

one may use higher order Markov models called weight array models (WAM).

These methods assume a fixed length signal.

Hidden Markov models further allow for insertions and deletions.

Most existing programs use such models to represent and detect

- splice sites

- branch points

- correct intron/exon boundaries

and other motives like

- poly(A) sites (in 3‘-UTRs)

- promoters




Predict eukaryotic gene structures

One doesn‘t want to only search for independent exons, but instead identify the

whole complex structures of genes!

Each consistent pair of detected signals (translation starts and stops, spice sites)

defines a potential gene region (intron, exon or coding part of an exon).

Since all these potential gene regions can be used to build a gene model, the

number of potential gene models grows exponentially with the number of

predicted exons!

In practice, „correct“ gene structures must satisfy a set of properties

(i) there are no overlapping exons

(ii) coding exons must be frame compatible

(iii) merging two successive coding exons will not generate an in-frame stop at

the junction.

The number of candidates remains, however, exponential.




Testing the reliability of an ORF prediction

(1) Observation of unusual type of sequence variation found in ORFs:

every 3rd base tends to be the same one much more often than by chance alone

(Fickett 1982).

This property is due to nonrandom use of codons in ORFs and is true for any

ORF, regardless of the species.

(2) Determine whether the codons in the ORF correspond to those used in other

genes of the same organism (codon usage statistic).

(3) Translate ORFs into amino acid sequence and compare that to database of

protein sequences. If good hits are found, confidence in new predicted ORF rises.



Neural Network: GRAIL II

Grail II provides analyses of protein-coding regions, poly(A) sites, and promoters

constructs gene models,

predicts encoded protein sequences

provides database searching capabilities.

(1) create list of most likely

exon candidates

(2) evaluate candidates by

neural network

Uberbacher, Mural. PNAS, 88, 11261 (1991)



Glimmer 2.0

Glimmer 2.0 is an extension. Probabi-lities may be computed for any of the k bases preceeding a particular base b, not just those adjacent to b.

Sample ICM (interpolated context model) decomposition tree. The root position 12 has maximum mutual information with the final base position 13. Each child of the root represents the subset of windows with the indicated nucleotide value at position 12, and indicates the maximum mutual information position for that subset. Each node is similarly decomposed into children. Note that children of a single node may represent different base positions.

Salzberg et al. NAR, 27, 4636 (1999)

Glimmer 1.0 uses an interpolated Markovmodel (IMM) where the search depth k is optimized for every tuple.



Accuracy of Glimmer 2.0

Glimmer 2.0's accuracy for ten complete bacterial and archaeal genomes. The

majority of the genes missed were very short, either below the minimum or very

close to it. The default settings produce additional gene predictions ranging

from 7-20% of the total, many of which are likely to be false positives, but some

of which may be genuine. The additional prediction rate drops quickly if the

minimum gene length is set to be greater than 90bp.



Accuracy of Glimmer 2.0

A better measure of accuracy is to consider only "confirmed genes," which we

define as genes that have a significant database match to a gene in another

organism. The table below shows these statistics on 10 genomes for both

Glimmer 1.0 and 2.0. All results were obtained by a very simple training

procedure: Glimmer was trained by first extracting all non-overlapping open

reading frames over 500bp. The trained model was then used to find genes in the

complete genome.



Glimmer 1.0 vs. Glimmer 2.0

Numbers of genes confirmed by database matches found exclusively by

GLIMMER 1.0, by GLIMMER 2.0, and by both systems.



Newly identified genes

Genes in M.tuberculosis found

automatically by GLIMMER

2.0 with homology to protein

sequences from other

organisms.




Problematic gene start predictionDetecting a gene as a protein-coding

ORF with an ‘open’ start still does

not provide full information for gene

annotation. Although several

procedures for gene start prediction

accuracy have been described,

verification of the actual accuracy of

these methods has been hampered

by an insufficient number of

experimentally validated translation

starts and, therefore, a deficit of

reliable data for training and testing.

In the absence of a reliable computer

procedure for gene start prediction,

the rule of the ‘longest ORF’ was

frequently applied to annotate

complete microbial genomes with

gene start assigned to the 5'-most

ATG codon (see Table). Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)




GeneMarkS

Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)



Gene prediction accuracy




Performance of GeneMarkS on eukaryotic genomes




Comparable performance of Glimmer and GeneMarksS




GeneMarkS(A) In the process of GeneMarkS training there

is no division of the coding sequence into two

clusters. However, in applying the

GeneMark.hmm 2.0 program, the model of

coding region derived by GeneMarkS can be

used as the Typical model along with a

heuristic model used as the Atypical model .

(B) In this simplified diagram of hidden state

transitions in GeneMark.hmm 2.0, the state

‘gene’ represents a sequence composed of an

RBS plus a spacer plus the protein-coding

sequence (CDS). Gene overlaps encompass

all possible types of superpositions: overlap of

genes on the same strand (as observed in

operons), overlap of genes on opposite

strands, overlap of coding region with RBS,

and so on. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)



Positional frequency pattern

Sequence logo representing the RBS positional

frequency pattern detected by GeneMarkS in the

analysis of B.subtilis genomic data. The total

height of the four letters in each position

indicates the position specific information

content, while the height of each letter is

proportional to the nucleotide frequency.




Spacer length(B) Graph of probability

distribution of spacer length,

the sequence between the

RBS sequence and the gene

start.




Gene prediction evaluationVenn diagram showing

group relationships

between the GenBank

annotation and sets of

genes detected by

GeneMark.hmm 2.0 and

Glimmer 2.02 for the

B.subtilis genome (A) and

the E.coli genome (B).




Ribosome binding sitesDistributions of log-odds scores of RBS sites, as

detected by GeneMarkS, in sets of overlapping and

non-overlapping of genes of (A) B.subtilis, (B) E.coli

and (C) M.jannaschii. As can be seen, the overlapping

genes, which are likely to be located inside operons,

frequently have strong RBS sites. Still, most strong

sites of ribosome binding precede the non-overlapping

genes (stand alone genes and genes leading

operons). This tendency is much more apparent in the

case of the archaeal genome of M.jannaschii than in

the E.coli and B.subtilis genomes




RBS motifSequence logo representing the RBS motif

observed in a subset of upstream

sequences of the A.fulgidus genome. This

subset consisted of 50 nt long upstream

sequences overlapping the 3' end of the

preceding gene. The consensus of this

motif is complementary to a section of the

A.fulgidus 16S rRNA.




Spacer lengthDistributions of spacer length for

two species with strong RBS

patterns, B.subtilis and E.coli

(solid and dashed lines,

respectively), and one species

with a strong eukaryotic

promoter-like pattern, A.fulgidus

(dotted line).

The promoter-like pattern of

A.fulgidus is localized much

further upstream of the start

codon than the RBS patterns of

B.subtilis and E.coli.




Distribution of spacer lengths(A) Distribution of spacer lengths observed

in the B.subtilis genome for two different

types of possible RBS hexamers:

AGGAGG and AGGTGA. Multiple

alignment allows these hexamers to be

superimposed. In actual upstream

sequences, these hexamers tend to

occupy different locations relative to the

start codon. This preference may be

involved in the precise positioning of the

ribosome at the translation initiation site

when the 16S rRNA binds to mRNA. The

more frequent hexamer was observed on

average at a further distance from the

gene start than the rare hexamer.




Distribution of spacer lengths(B) Distribution of spacer lengths observed

in the M.thermoautotrophicum genome for

two different types of RBS hexamers:

GGAGGT and GGTGAT. Properties of

these hexamers are similar to the two

hexamers observed in the B.subtilis

genome (A), except that more frequent

hexamer is now found on average at a

closer distance to the gene start than the

rare hexamer.




TIGR: GlimmerM, Exonomy and UnveilGenefinder programs for eukaryotes are far from perfect, often predicting genes

or exons which do not exist, failing to predict those that do exist or generating

predictions having one or more incorrect exon boundaries.

Also, different genefinders trained for the same organism often produce different

predictions.

The latest publication by The Institute of Genome Research (TIGR) therefore

suggests 3 genefinders that are all based on one or more types of Markov

models: GlimmerM, Exonomy, Unveil.

Majoros et al. Nucl. Acids. Res. 31, 3601 (2003)



TIGR: GlimmerM, Exonomy and Unveil

Topologies of Unveil Exonomy

283-state HMM 23-state GHMM




Methods don’t work everywhereAn example in which Exonomy

produces the correct gene model.

An example in which GlimmerM

produces the correct gene model.

An example in which Unveil

produces the correct gene model

(as does Genscan).




SummaryThe results of gene prediction should be taken with caution; although the results

are becoming increasingly more reliable, they do remain only predictions.

They are very useful for speeding up gene discovery and knowledge mining.

Biological expertise remains necessary in order to confirm the existence of a virtual

protein and to find or prove its biological function.

Therefore comparative genome approaches are becoming more and more

important where programs can scan for homologies with expressed sequences

(EST or cDNA sequence data).

More work is now focussing on detecting RNA-coding genes.


9. lecture ws 2003/04bioinformatics iii1 gene finding material of this lecture taken from - chapter...

Documents

availablestatistical

predictive methods

eukaryotic gene

popular gene

combinea homology methods

use gene predictionprogram

prokaryotic genomes

prokaryotic genes advantages