Download - Genome analysis2
Gene identification
Open Reading frame
Six ORFs of dsDNA
Six ORFs of dsDNA
Complication with Introns
Genome annotation
•Annotation : Obtaining biological information from unprocessed sequence data
•Structural annotation : Identification of genes and other other important sequence elements
•Functional annotation : The determination of the functional roles of genes in the organism
Genome annotation
•Raw genomic sequence can be annotated by,
i. Comparison with databases of previously cloned genes and ESTs
ii. Gene prediction based on consensus features such as Promoters Splice sites Polyadenylation sites and ORFs
Gene identificationGene finding in eukaryotes is difficultGenome GenesBacterial genome 80-85%Yeast 70% Fruit fly 25%Human genome 3-5%
In human genome, Typical exon = 150bp Intron = Several kbs Complete gene = Hundreds of kbs
ORF prediction•Three reading frames are possible from each strand of a DNA using “six-frame translation process” - Result is 6 potential protein sequences - Longest frame uninterrupted by a stop codon is the correct one
•Finding the ends of ORF is easier than finding beginning Beginning can be find using, - Start codon - kozak sequence (CCGCCAUGG) flanking start codon - CpG islands
Software programs for gene identification
•Advantage : Speed – annotation can be carried out concurrently with sequencing itself.
•Disadvantage : Accuracy
•Two strategies used are, - Homology searching - ab initio prediction
ab initio prediction
Based on type of algorithm,GRAIL – Based on neural networks - Predicts exons, genes, promoters, polyAs, CpG islands EST similarities, repetitive elements,
GeneFinder – Rule-based system
GENSCAN, GENEI, HMMGene, GeneMarkHMM, FGENEH – Hidden Markov model
Genescan
ab initio prediction
1. Feature dependent methods, Features of eukaryotic genes recognized are, -Control signals such as TATA box, cap site, Kozak consensus and polyadenylation sites
HEXON, MZEF are gene predicting programs that can predict only a single feature, exon.
2. Few programs depend on differences in base composition
ab initio predictionAccuracy problem – Algorithms are not 100% accurate
Errors include - Incorrect calling of exon boundaries - Missed exons - Failure to detect entire genes
Solution:Running different programs on single genome
Homology searching•Finding genes in long sequences by looking for matches with sequences that are known to be transcribed, e.g. cDNA, EST or a gene
Programs used are BLAST (Basic Local Alignment Search Tool)based, BLASTN BLASTX BLASTP etc.
Homology searching or ab initio ?
•Algorithms that take similarity data into account are better at gene prediction – Reese et al(2000), Fortna et al(2001)
Latest gene prediction algorithms combine similarity data with ab initio methods examples : Grail/Exp, GenieEST, GenomeScan
tRNAScanSE : For tRNA identification
Advanced gene finding programs
GLIMMER•Gene Locator and Interpolated Markov ModelER•For finding genes in microbial DNA
GLIMMER
GLIMMER
GeneMark
GeneMark
GenScan