gene finding pipelines for automatic annotation of new eukaryotic and bacterial genomes victor...

Gene finding pipelines for automatic annotation of new eukaryotic and

bacterial genomes

Victor SolovyevProfessor of computer science, Royal Holloway,

University of London

Chairman, Softberry Inc.

New genomes sequencingHuman, Mouse, Rat, Cow, Sheep, Cat, Dog, Pig, Chicken, Drosophila, Bee, Zebrafish, Fugu, Nematodes

Arabidopsis, Rice, Medicago, Soybean, Barley, Poplar, Tomato, Oat, Wheat, Corn

S.cerevisiae, S.pombe, Aspergillus nidulans,Coprinus cinereusCryptococcus neoformans,Fusarium graminearumMagnaporthe griseaNeurospora crassaUstilago maydis

Anopheles, P. falciparum, E. cuniculi, Chlamy, Ciona, Diatom, White rot, P. sojae

Bacterial & bacterial communities

Translation

Enhancer

3’-5’-

Core promoter

Start of transcription

Transcription, 5’-Capping and 3’-polyadenilation

Splicing (removing of intron sequences)

Pre-mRNA

mRNA

Protein

5’-non-coding exon 3’-non-coding

exon

Poly-AsignalInternal exons

Introns

ATG-codon

Stop-codon

Expression stages and structural organization of typical eukaryotic protein-coding gene

Ab initio multiple gene prediction approaches

Probabilistic Pattern recognition Genescan (Burge, Karlin,1997)

HMMgene (Krogh, 1977)

Fgenesh (Salamov, Solovyev,1998)

Genie (Reese et al., 2000)Fgenes (Solovyev,1997)

Likelihoods of genecomponents, HMM

Discriminant functions

Balanced score as productionof likelihoods, simple features

Flexible combinationsof any discriminative features

GeneID (Guigo at al. 1992)

Neural networks

E5

I0 I1 I2

E0 E1 E2

E0

EL

3’-5’-

EF

PolyAPr

N

E3

I5I3

PolyA Pr

E0 5’-3’-

E5 EFEL E3

I0 I1 I2I3I5

E0 E1 E2

Hidden Markov model

of

multiple eukaryotic

genes

Used in

Genescan and Fgenesh

programs

Ei and Ii are different exon and intron states,

respectively (i=0,1,2 reflect 3 possible different ORF).

E 5/3 marks non-coding exons and

I5/I3 are 5’- and 3’-introns adjacent to non-coding exons.

Figure 10. Different functional regions of the first, internal, last and single coding exons corresponding to components of recognition functions. RBS is ribosome binding site, ORF is open reading frame, A and D are acceptor and donor splice sites, respectively, Stop is (TAA, TAG, TGA stop codons).

ORF

ORF

ORF

Stop ATG

Stop A

D A

D ATG ORF

3’-region 5’-region

3’-region intron

intron intron

5’-exon

Internal exon

3’-exon

Single exon

5’-region intron

RBS

Signal differences: start of translation

Signal differences: donor splice site

Importance of good specific parameters: Rice example Fgenesh with Monocot gene-finding parametres

.

Strategy to make gene-finding parametersfor new genomes

1. Using GeneBank genes for close organisms

2. Using new genomic sequence• a) Having known mRNA/cDNA sequences

• Map mRNA by EST_MAP program on genomic sequence

• Extract genes and use them as learning set

• b) Using ab initio gene-prediction

Predict genes, Select genes with protein support

c) Using a database of known proteins (NR) that can be mapped on genome by Prot_map program with reconstructing gene-structure

In addition find protein coding ORF by BESTORF program in a set of ESTs and use them in learning of coding parameters

Learning parameters using GeneBank genes for close organisms

Select GeneBank organism class having enough known genes

Create Infogene database with reconstructed genes running Infog program (some genes might be described in several GeneBank entries)

Run GetGenes program to extract genes from Infogen to use in learning programs

(with cleaning genes with errors in annotation in ORF and splice sites)

Run Efeature program to create:

set of coding regions (usually significantly bigger than set of genes)

set of non-coding regions

Run scripts/programs of learning coding parameters (might be several GC zones)

Run scripts/programs of learning splice sites parameters

Run scripts/programs to create exon length distributions and other statistics

Check parameters of initial probabilities (exons/introns/noncoding)

depending on gene density in genome and gene structure

Test and edit parameters to select the best variant.

Repeat learning on bigger or smaller organism classes and select the best learning set.

Developed parametersfor fgenesh group of programs:

• Human, Mouse, Drosophila, C. elegans, Fish (WUSTL, Baylor, CSHL, JGI)

• Dicots (Arabidopsis), Nicotiana tabacum,

• Monocots (Corn, Rice, Wheat, Barley) (TIGR, Rutgers University)

• Algae, Plasmodium falciparum, Anopheles gambiae

• Schizosaccharomyces pombe, Neurospora crassa,

• Aspergillus nidulans, Coprinus cinereus, Cryptococcus neoformans, Fusarium graminearum, Magnaporthe grisea, Ustilago maydis (MIT/Broad Institute)

• Medicago (University of Minnesota)

• Brugie malayi (TIGR)

FGENESH++: AUTOMATIC EUKARYOTIC GENOME

ANNOTATION PIPELINE1. RefSeq mRNA mapping by Est_map program - mapped genes are excluded

from further gene prediction process.

2. Map all known proteins (NR) on genome by Prot_map program with gene structure reconstruction (find regions occupied by genes)

3. Run Fgenesh+ using mapped proteins and selected genome sequences

4. Run ab initio Fgenesh gene prediction on the rest of genome.

5. Search for protein homologs (by BLAST) of all products of predicted genes in NR.

6. Run Fgenesh+ gene prediction on sequences (from stage 4) having protein homologs.

7. Second run of Fgenesh in regions free from genes selected on stages 1,3,5.

8. Run of Fgenesh gene predictions in large introns of known and predicted genes.

Special variant of FGENESH++ can take into account synteny (human-mouse, for example) using FGENESH-2 program that predicts genes using 2 similar

genomic sequences from different species.

Components of Fgenesh++ automatic pipeline:

Fgenesh – ab initio gene prediction. Run on whole chromosomes (~300MB). FAST: The Human genome of 3 GB sequences is processed for ~ 4 hours

Fgenesh+ This derivative of Fgenesh uses information on homologous proteins to improve accuracy of gene prediction, if such homologs can be found.

Fgenesh-2 Variant of Fgenesh that uses homology between two genomic DNAsequences, such as human and mouse, as an extra factor for more accurate gene prediction.

Fgenesh_C uses information on homologous mRNA/EST to improve accuracy of gene prediction. Can be used to reconstruct alternatively spliced genes.

Gene-finding group of program have mostly common components and working with the same organism-specific

parameters

Components of Fgenesh++ automatic pipeline:

Est_map a program for fast mapping of a set of mRNAs/ESTs to a

chromosome sequence. It takes into account splice site weight matrices for accurate mapping (important for accurate mapping very small exons).

Prot_map is used for fast mapping a database of protein sequences to

genome with accounting for splice sites (useful for genomes with a few known genes and to search for pseudogenes).

Programs for mapping known mRNA/Est or proteins with reconstruction of gene structure

Example of Prot_map mapping of a protein sequence to genomeFirst sequence Chr19 [cut:1 3000000]

[DD] Sequence: 1( 1), S: 52.623, L:1739 IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Summ of block lengths: 1468, Alignment bounds:

On first sequence: start 2146727, end 2167939, length 21213

On second sequence: start 263, end 1739, length 1477 Blocks of alignment: 19

1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.313, W: 1160, S:14.1355

2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 102.892, W: 1810, S:17.7256

3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 102.539, W: 720, S:11.1699

4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 101.777, W: 1880, S:18.0157

5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.251, W: 3930, S:26.0143

6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 105.778, W: 2000, S:18.7669

7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.188, W: 1510, S:16.1227

8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.414, W: 1690, S:17.0246

9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.168, W: 2240, S:19.5414

10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.071, W: 2110, S:19.0531

11 E: 2156280 485 [AG GT] P: 2156280 642 L: 161, G: 102.616, W: 8290, S:37.9091

12 E: 2156954 136 [AG GT] P: 2156955 804 L: 45, G: 103.244, W: 2340, S:20.1719

13 E: 2157771 147 [AG GT] P: 2157771 849 L: 49, G: 98.511, W: 2360, S:20.0267

14 E: 2160107 115 [AG GT] P: 2160107 898 L: 38, G: 100.777, W: 1900, S:18.0672

15 E: 2161975 584 [AG GT] P: 2161977 937 L: 194, G: 101.031, W: 9740, S:40.932

16 E: 2163280 206 [AG GC] P: 2163280 1131 L: 68, G: 103.135, W: 3530, S:24.7691

17 E: 2165387 65 [AG GT] P: 2165388 1200 L: 21, G: 98.427, W: 1010, S:13.0987

18 E: 2166182 945 [AG GT] P: 2166184 1222 L: 314, G: 102.034, W: 16020, S:52.6232

19 E: 2167736 608 [AG ta] P: 2167738 1538 L: 202, G: 104.624, W: 10730, S:43.3437

Prot_map example of alignment 1 11 2146713 2146723 2146739 2146769

gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg

...............(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS-

248 248 249 259 267 277

2146797 2146806 2147558 2147568 2147581 2147611

]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK

---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK

286 286 286 286 289 299

2147641 2147671 2147686 2148919 2148926 2148937

PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP

PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP

309 319 322 322 322 323

2148967 2148982 2150384 2150391 2150402 2150432

KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP

KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP

333 336 336 336 337 347

2150462 2150492 2150513 2150523 2150609 2150619

TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK

TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK

357 367 373 373 373 373

2150644 2150674 2150704 2150734 2150764 2150794

AGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQTALDALHAQT

AGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQTALDALHAQT

Analysis of gene-finding accuracy and running time

Test on 83 small (< 20 000 bp) human genes using mouse homolog:

Prot_map: Sne= 73.7 Sn_pe- 93.3 Spe- 71.3 Sn_n= 93.9 Sp_n= 88.6 C=0.9015 Time ~ 1 min

Genewise: Sne= 76.4 Sn_pe- 93.9 Spe- 76.4 Sn_n= 94.9 Sp_n=89.4 C=0.9116 Time ~ 90 min

Fgenesh:

Test on 8 big (> 400 000 bp) human genes using mouse homolog:

Prot_map: Sne= 87.9 Sn_pe- 96.0 Spe- 81.3 Sn_n= 94.3 Sp_n= 96.0 C=0.9514 Time ~ 1 min

Genewise: Sne= 91.9 Sn_pe- 97.0 Spe- 90.1 Sn_n= 95.1 Sp_n= 97.0 C=0.9599 Time ~ 1200 min

Fgenesh:

Prot_map mapping of Human protein set

of 55946 proteins on chromosome 19 (~59 MB)

takes 90 min (best hit for each protein) and

148 min (all significant hits for each protein)

Can be used for fast finding of an initial gene set in new genome mapping all known proteins

Used for pseudogenes finding as mapping with frameshifts damaging ORFs

New Fgenesh+ and Genewise1) 700 genes with 6508 exons having similar protein with > 90% similarity

GeneWISE: Sne= 94.1 Sn_pe- 97.8 Spe- 96.0 Sn_n= 98.9 Sp_n= 99.6 C=0.992

FGENESH+: Sne= 96.9 Sn_pe= 98.5 Spe= 97.9 Sn_n= 99.0 Sp_n= 99.5 C=0.992

2) 18 genes with 116 exons having similar Drosophila protein

with identity 28-70%

GeneWISE: Sne= 40.5 Sn_pe- 64.7 Spe- 62.7 Sn_n= 68.3 Sp_n= 99.7 C=0.813

Fgenesh+: Sne= 70.7 Sn_pe= 84.5 Spe= 82.0 Sn_n= 84.8 Sp_n= 96.9 C=0.898

5’-exon Observed - 18 Predicted - 14 Correct - 2 (11 by Fgenesh+)

Intr: Observed - 80 Predicted - 43 Correct - 38 (59 by Fgenesh+)

3’-exon: Observed - 18 Predicted - 14 Correct - 7 (12 by Fgenesh+)

Run time: Fgenesh+ 50 – 1000 times faster than GeneWise

Automated Gene Calling atCenter for Genome Research MIT

Gene structures are predicted using a combination of FGENESH, FGENESH+, and GENEWISE (Sanger Institute). the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE

1. If this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE

Sequencing: 2003/2004 – 6 new yeast genomes 2004/2005 ~ 20 new yeast genomes

Examples of usage Fgenesh suit in genome annotations

• Grimwood J, Gordon LA, Olsen A, .., Salamov A., Solovyev V., ..., Lukas S. (2004) The DNA sequence and biology of human chromosome 19. Nature, 428(6982), 529-535. Using Fgenesh, Fgenesh+, est_map to annotate genes in Himan cjromosome 19. annotation. · Heiliget al. (2003) The DNA sequence and analysis of human chromosome 14. Nature 421, 601 - 607. FGENESH used for human chromosome 14 annotation. · Hillier et al. (2003) The DNA sequence of human chromosome 7. Nature 424, 157 - 164. Extensive use of FGENESH-2 for human chromosome 7 annotation. Feng et al. (2002) Sequence and analysis of rice chromosome 4. Nature 420, 316 - 320. FGENESH used for annotation of rice chromosome 4.

• Galagan et al. (2003) The genome sequence of the filamentous fungus Neurospora crassa. Nature 422:859-868. Neurospora genome annotation based on FGENESH and FGENESH+. Lander et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860 - 921. Original paper on sequencing human genome by public consortium also reports use of FGENESH genefinder for genome annotation. Deloukas et al. (2001) The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865-871. Use of FGENESH for annotation of human chromosome 20. Yu et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79-92. Rice genome sequencing and annotation project used FGENESH as primary source of gene predictions.

Holt et al. (2002) The Genome Sequence of the Malaria Mosquito Anopheles gambiae. Science 298: 129-149. Use of FGENESH for annotation of Anopheles genome.

Canonical and Non-canonical splice sites

SpliceDB (Burset, Seledtsov, Solovyev, NAR 1999,2000)

Gene prediction is usually done with only standard splice sites

a) GT-AG group (canonical splice sites): 22199 examples

M70A60G80|GTR95A71G81T46

Y73Y75Y78Y79Y80Y79Y78Y81Y86Y86NC71AG|G52

b) GC-AG group: 126 examples

M83A89G98|GCA87A84G97T71

c) AT-AC group: 8 annotated examples + 2 examples recovered from annotation errors

S90|ATA100T100C100C100T100T90T70

T70G50C70NC60AC|A60T60

GT-AG: 99.24% GC-AG: 0.69% AT-AC: 0.05% other sites: 0.02%

Additional sources of genes

• Identified with synteny data help

• Non canonical splice sites

• Alternatively spliced

• Alternative promoters

• Alternative poly-A

Additional studies of the above topics will update the current gene collections

Exon-based syntheny

1. Run Gene-finding annotation pipeline for each genome

2. Select chains of similar exons between 2 genomes comparing coding exons by Blast

95% in agreement with filtered genomealignments

Brudno et al.(2004) Automated Whole-Genome Multiple Alignment of Rat, Mouse, and Human Genome Research Journal, 14(4):685-692.

Pseudogene finding using Prot_map

54408308 54408560 54408568 54408581 54408611 54408641

nnnnnnnn(..)nnnnnnnnnnnnnag[KEFDFESANAQFNKEEMGREFHNKLKLKEDKL

--------(..)--------------- KDFDFESANAQFNKEEIDREFHNKLKLKEDKL

134 134 134 136 146 156

54408671 54408701 54408731 54408761 54408791 54408821

EKEEKPVNGEDKGDSGVDTQNSEGHADEEDALGPNCFYDQTKSSFDNISGDDNRERRPTW

EKQEKPVNGEDKGDSGVDTQNSEGNADEEDPLGPNCYYDKTKSFFDNISCDDNRERRPTW

166 176 186 196 206 216

54408851 54408881 54408911 54408939 54408967 54408976

AEGRRLNAETFGIPLCPNRGHGGYRGRGaGLGFHGGRGRg]gtggcagaagtggta(..)

AEERRLNAETFGIPLRPNRGRGGYRGRG-GLGFRGGRGR- ---------------(..)

226 236 246 255 264 264

Pseudogene finder

• Generation pseudogene candidates:• Run script finding genes having almost identical

coding proteins (or part of them) with lesser number introns (or without introns).

• Run prot_map mapping Human (mammalian) proteins and selecting damaged ones

• Selecting pseudogenes using additional features: like poly_A tail, ratio ks/kn

Table 7. Selected characteristics of promoter sequences used by TSSW programs for TATA+

and TATA- promoters.

Characteristiscs D2 for TATA+ promoters D2 for TATA- promoters

•Hexaplets -200 - -45 2.6 1.4 (-100 - -1)

•TATA box score 3.4 0.9

•Triplets around TSS 4.1 0.7

•Hexaplets +1 - +40 0.9

•Sp1-motif content 0.9

•TATA fixed location 0.7

•CpG content 1.4 0.7

•Similarity -200 - -100 0.3 0.7

•Motif Density(MD) -200 - +1 4.5 3.2

•Direct/Inverted MD-100 - +1 4.0 3.3 ( -100 - -1)

Total Maxalonobis distance 11.2 4.3

Number promoters/non-promoters 203/4000 193/74000

Development of eukaryotic promoter recognizer

In group of TSS programs

Results of promoter search on genes with known mRNAs by different promoter-finding programs. Reproduced from

Liu and States (2002) Genome Research 12:462-469.

Accuracy of prediction by TSSP on plant genomic sequences

Selected known genomic regions upstream of CDS

True positives 92%

Total number of False positives for 40 TATA promoters: 22

(1 per 3648 bp)

True positives 95%

Total number of False positives for 25 TATA –less promoters: 15

(1 per 3300 bp)

For every class (TATA and TATA-less) promoters only one predicted TSS with highest score in an interval of 300 bp was taken during the search.

tssp Wed Jul 10 02:52:32 EDT 2002 >gi|1902902|dbj|AB001920.1| Oryza sativa (japonica cultivar-group) gene for phos Length of sequence- 5871 Thresholds for TATA+ promoters - 0.02, for TATA-/enhancers - 0.04 2 promoter/enhancer(s) are predicted Promoter Pos: 1522 LDF- 0.13 TATA box at 1488 18.93 Enhancer Pos: 1597 LDF- 0.12 Transcription factor binding sites/RegSite DB: for promoter at position - 1522 1468 (-) RSP00004 tagaCACGTaga 1459 (+) RSP00010 cACGTG 1456 (+) RSP00011 ctccACGTGgt 1461 (+) RSP00016 caTGCAC 1468 (-) RSP00016 caTGCAC 1256 (-) RSP00026 gcttttgaTGACtTcaaacac 1460 (+) RSP00065 ACGTGgcgc 1460 (+) RSP00066 ACGTGccgc 1459 (+) RSP00069 tACGTG 1341 (+) RSP00071 GACGTC 1346 (-) RSP00071 GACGTC 1452 (-) RSP00096 GGTTT 1432 (+) RSP00129 CACGAC 1281 (+) RSP00148 CGACG 1284 (+) RSP00148 CGACG 1315 (+) RSP00148 CGACG 1335 (+) RSP00148 CGACG 1340 (+) RSP00148 CGACG 1365 (+) RSP00148 CGACG 1434 (+) RSP00148 CGACG 1458 (+) RSP00148 CGACG 1347 (-) RSP00148 CGACG 1474 (+) RSP00162 ACACccGagctaaccacaac 1348 (+) RSP00241 CGGTCA 1387 (+) RSP00339 RTTTTTR 1264 (-) RSP00397 AGTGGCGG 1268 (+) RSP00422 ACCGAC 1459 (+) RSP00423 GACGTG 1464 (-) RSP00424 CACGTC 1369 (-) RSP00431 rdygRCRGTTRs 1278 (-) RSP00432 cVacGGTaGGTgg 1249 (-) RSP00436 TTGACT 1260 (+) RSP00463 atttcatggCCGACctgcttttt 1260 (+) RSP00464 acttgatggCCGACctctttttt 1260 (+) RSP00465 aatatactaCCGACcatgagttct 1265 (+) RSP00466 actaCCGACatgagttccaaaaagc

1 1

45

2 2

2

F i g u r e 2 . L o c a l i z a t i o n o f t h e p r e d i c t e d n e a r e s t T S S i n r e l a t i o n t o t h e k n o w n T S S s f o r

3 5 o f 4 0 g e n e s w i t h t h e a n n o t a t e d T A T A p r o m o t e r s .

-144

-6:-

20

-5:-

1 0

+1:

+5

Gen

es w

ith

pre

di

D i s t a n c e b e t w e e n P r e d i c t e d a n d a n n o t a t e d T S S

cted

TS

S

P r e d i c t e d T S S s r e l a t i v e t o a n n o t a t e d T S S ( 0 )

-37

PromH with ortologous sequences

******* 753 ****** ******

866 GTGGGgCTA-TTTTAaGcaCAGCcTCTtGGcCTgCACactcccctggcccccagCCcCCaGCaGcTCaGCtACTGGTcACCTGCC---------ACCgCCTGGAATGCTGATTGGCAGTTgGct GTGGGaCTAtTTTTAtGtcCAGCtcCTcGGaCTaCAC----------------cCCaCCcCCtGtTCtGCcACTGGTtgCCTGCCtctgcctgtACCtCCTGGAATGCTGATTGGCAGTTaG-- 987 1094 TATA(-26) TSS(+1) 867 ******** 951 GGggtGgGTGGGGGCTGGGAAGACacTaTTATAAAGCtgggAGtG-TtgGGaAGCAGCcGTCcCC-----gTCCaGaGTCCtCTGtggtCC . . .CDS(963) GGctgGaGTGGGGGCTGGGAAGAC-—TgTTATAAAGCctaaAGgGcTaaGGgAGCAGCtGTCaCCtggagcTCCtGcGTCCcCTGcccaCC CDS(1182) 1095 1181 TATA(-28) TSS(-1)

Fgenesb_annotator - Bacterial Gene/Operon

Prediction and Annotation Pipeline FGENESB is a new complex package for annotation of bacterial genomes. Its gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. Operon models are based on distances between ORFs, frequencies of different genes neighboring each other in known bacterial genomes, predicted promoters and terminators The parameters of gene prediction are self-learning, so the only input necessary for annotation of new genome is a sequence.

Fgenesb accuracy on difficult sets

Sn (exact Sn (exact+overlapping predictions) predictions) 123set: Glimmer 57.0% 91.1 GeneMarkS 82.9 91.9 FgenesB 89.3 98.4 72set: Glimmer 57.0% 91.7 GeneMarkS 88.9 94.4 FgenesB 91.5 98.6 51set: Glimmer 51.0% 88.2 GeneMarkS 90.2 94.1 FgenesB 92.0 98.0

STEP 1.

Finds all potential ribosomal RRNA genes using BLAST against bacterial and/or archaeal RRNA databases.and masks detected RRNA genes.

STEP 2. Predicts tRNA genes using tRNAscan-SE program.Inside bactg_ann.pl - run tRNAscan-SE and masksdetected TRNA genes .

rRNA and tRNA annotation

STEP 3.Initial predictions of long, slightly overlapping ORF that are used as a starting point for calculating parameters of predictions. Iterates until stabilizes.Generates parameters such as 5th-order in-frame Markov chains for coding regions, 2nd-order Markov models for region around start codon and upstream RBS site, Stop codon and probability distributions of ORF lengths.Protein coding genes prediction

STEP 4.it predicts operons based only on distances between predicted genes.

Genes and Operon identification

STEP 5.Runs blastp for predicted proteins against COG database-cog.pro and annotate by COGs descriptions

STEP 6. Run blastp against NR for proteins having no COGs hitsAnd annotate by NR descriptions.

Annotate genes comparing with databases of known proteins

STEP 7.Uses information about conservation of neighbor gene pairs in known genomes to improve operon prediction.STEP 8.predicts potential promoters (tssb) and terminators (bterm) in the corresponding 5'-upstream and 3'-downstream regions of predicted genes.

Tssb- bacterial promoter prediction (sigma70), using dicriminant function with characteristics of sequence features of promoters (such as conserved motifs, binding sites and etc)

Bterm - prediction of pho-independent terminators as hairpins, with energy scoring based on discriminant function of hairpin elements. STEP 9.refines operon predictions using predicted promoters and terminators as additional evidences.

Promoters and Terminators prediction and improvement of operons assignment

1 1 Op 1 21/0.000 + CDS 407 - 1747 1311 ## COG0593 ATPase involved in DNA + Term 1786 - 1823 3.2 + Prom 1847 - 1906 10.5 2 1 Op 2 3/0.019 + CDS 1926 - 3065 1237 ## COG0592 DNA polymerase + Term 3074 - 3122 9.1 + Prom 3105 - 3164 4.0 3 2 Op 1 4/0.002 + CDS 3193 - 3405 278 ## COG2501 Uncharacterized ACR 4 2 Op 2 4/0.002 + CDS 3418 - 4545 899 ## COG1195 Recombinational DNA 2 Op 3 16/0.000 + CDS 4578 - 6506 2148 ## COG0187 DNA gyrase (topoisomerase II) B subunit + Term 6516 - 6551 4.7 + Prom 6512 - 6571 2.3 6 2 Op 4 . + CDS 6595 - 9066 2957 ## COG0188 DNA gyrase (topoisomerase II) A subunit + Term 9067 - 9098 3.4 + SSU_RRNA 9308 - 10861 100.0 # AY138279 [D:1..1554] # 16S ribosomal RNA # Bacillus cereus + TRNA 10992 - 11068 101.2 # Ile GAT 0 0 + TRNA 11077 - 11152 93.9 # Ala TGC 0 0 + LSU_RRNA 11233 - 14154 99.0 # AF267882 [D:1..2922] # 23S ribosomal RNA # Bacillus 7 3 Op 1 . - CDS 14175 - 14363 158 + 5S_RRNA 14205 - 14315 97.0 # AE017026 [D:165635..165750] # 5S ribosomal RNA # Bacillus 8 3 Op 2 . - CDS 14353 - 15249 351 ## Similar_to_GB 9 3 Op 3 . - CDS 15170 - 15352 99 - Prom 15373 - 15432 6.9

Fgenesb_annotator output:

Comparison of 2 bacterial genomes

GenomMatch aligns 2 bacterial genomes 2 MB x 2MB ~ 30 sec

Figure1

Nature (2004) 428 (6978) , p. 37 – 43

Annotation of new bacteriaNew drugs

Annotation of bacterial communities DNA fromSpecific sources

(not growing in Labs)Oceans/Acid mines/agriculture

(with mix of 100s species)

New ferments

• Main Collaborators:

• Asaf Salamov, Igor Seledtsov,

• Ilham Shahmuradov

gene finding pipelines for automatic annotation of new eukaryotic and bacterial genomes victor...

Documents

run fgenesh

run of fgenesh

known genes

genebank genes

fgenesh gene predictions

reconstructed genes

cleaning genes

learning of coding parameters