Transcript
Page 1: Eukaryotic Genome Annotation

1  Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-9052 Gent, Belgium 2  INRA-associated to Bioinformatics & Evolutionary Genomics Division, Plant Systems Biology, VIB/Ugent, Technologiepark 927, B-

9052 Gent, Belgium

E-mail: [email protected] URL: http://bioinformatics.psb.ugent.be/

Eukaryotic Genome AnnotationEukaryotic Genome AnnotationLieven Sterck1, Stéphane Rombauts1, Jeffrey Fawcett1, Yao-Cheng Lin1, Steven Robbens1, Jan Wuyts1, Francis Dierick1, Pierre Rouzé2 and Yves Van de Peer1

1: Schiex T, Moisan A, and Rouzé P. (2001) EuGène: An Eukaryotic Gene Finder that combines several sources of evidence. Computational Biology, Eds. O. Gascuel and M-F. Sagot, LNCS 2066, pp. 111-125, 2001

This work is supported by the European Commission (QLRI-CT-2001-00006)

2: Tuskan et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ex Brayshaw). Science 313, 1596 - 1604

3: Derelle et al. (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features, Proc. Natl. Acad. Sci. USA 103, 11647-11652

Gene prediction and genome annotation have always been one of the main research topics of our group. Over the past years we have demonstrated the strength of our annotation platform and gained name and fame in the field of genome annotation through a number of collaborative efforts to annotate newly sequenced plant genomes. Now, although we are still involved in several annotation projects for higher plants, we are also more and more asked to be responsible for producing automatic genome annotations for a broader diversity of eukaryotic genomes like fungi and algae. IntroductionIntroduction

Raw sequence data is not useful for biologists. To be meaningful it has to be converted into biological significant knowledge : markers, genes, RNAs, protein sequences. Genome annotation is the first step toward this knowledge acquisition.

A thorough annotation must take into account:

• similarities with known sequences (proteins, ESTs, other genomes,…)

• region content analysis

• signal prediction software (ATG, splice sites)

• integrated prediction tools (GenScan, FgenesH, … )

• all available significant biological knowledge

Intrinsicapproaches

Extrinsicapproaches

RepeatMaskerBlastnBlastx

EuGene

PredictedGenes

(structural annotation)

ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT

ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT

join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001))

join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001))

Genomicsequence

EuGene is developed by T. Schiex and co-workers (INRA-Toulouse, France) in cooperation with our group.

Strengths of EuGeneStrengths of EuGene

ReferencesReferences

• EuGene can be specifically adapted to the particularities of newly sequenced genomes which leads to higher quality predictions• exploits probabilistic models like Markov models for discriminating coding from non coding sequences • integrates information from several signal (splice site, translation start...) prediction software or other 3rd party software• Exploits the wealth of existing sequences (mRNA, 5'/3' EST couples, proteins, genomic homologous sequences) • integrates each source of information through small independent software components, called "plugins"

The EuGene Annotation The EuGene Annotation PlatformPlatform

• each base of the genomic sequence is represented individually (nodes)

• weighting, removal and addition of edges according to available information

• shortest path in the graph = a possible gene structure

Based on all the available information, EuGene will output a prediction of maximal score, i.e. maximally consistent with the provided information.

Start sitesSplice sites

SpliceMachine

Content potential for

coding, intron and intergenic

• Coding IMM• Intron IMM• Intergenic IMM

Schematical representation of the EuGene platform. Depicted above is the basic set-up of EuGene, this scheme can be modified according to the genome that has to be annotated and the available data.

Information incorporationInformation incorporation

Try to automate this as much as possible through the use of annotation platforms.

Top Related