vectorbase gene sets

18
Slide 1 of 18 VectorBase gene sets A tutorial Martin Hammond VectorBase European Bioinformatics Institute December 2008

Upload: vectorbase

Post on 11-May-2015

801 views

Category:

Health & Medicine


1 download

DESCRIPTION

An introduction to the gene sets in VectorBase - how they are made and how to use them

TRANSCRIPT

Page 1: VectorBase gene sets

Slide 1 of 18

VectorBase gene sets

A tutorial

Martin HammondVectorBase

European Bioinformatics InstituteDecember 2008

Page 2: VectorBase gene sets

Slide 2 of 18

What this tutorial covers

• What is a VectorBase gene set?• How are the gene sets made?• What problems may there be with the gene models?• How are the sets affected by the genome assembly?• Are there different issues for the different organisms?• Can manual & community input improve gene sets?• How do gene models get their annotation?

Page 3: VectorBase gene sets

Slide 3 of 18

What is a VB gene set?

VB provides a single ‘official’ gene set for its mainspecies: 3 mosquitoes (Anopheles, Aedes, Culex), plusthe tick Ixodes scapularis. VB & its collaborators makethe best set we can, initially using automated genemodeling systems.

The same set is the primary gene annotation on theGenBank / EMBL records for the sequence assembly.

Having an official set with stable identifiers makes it easierfor the research community to talk to one another aboutgenes. And when improvements are made to a gene set,we keep track of how old & new models are related, soany work on a gene won’t be lost.

Page 4: VectorBase gene sets

Slide 4 of 18

What’s in a VB gene set?Most people are interested in protein-coding genes.

The gene structures VB presents are predictions(‘models’) built on the whole genome shotgunsequence assembly for the species. Many models willnot be exact representations of the ‘real’ gene in a realanimal, and some ‘real’ genes may be missing altogether.The reasons for this apply to all genome annotationprojects, not just to VB, and are discussed in latersections of this tutorial.

VB also presents non-coding RNA genes (ncRNA) suchas tRNAs, miRNAs, snRNAs etc.

Again, these are built on the assembly and are likely to beincomplete sets.

Page 5: VectorBase gene sets

Slide 5 of 18

How are the gene sets made?• The initial gene set for each species is a collaboration

between VB and one or more of the institutes involved insequencing the genome

• The initial annotation is automated (as opposed tomanual) and uses a variety of approaches to find genes &predict their structures

• VB annotation is combined with sets produced by ourcollaborators using complementary approaches, toproduce the initial gene set

• VB then takes over ongoing curation and improvement ofthe gene set

• VB’s annotation procedure is outlined in the next fewslides

Page 6: VectorBase gene sets

Slide 6 of 18

VB automatic annotation: overview

Set ITargetted set: Genewise

using species-specificprotein

Set 2Arthropod similarity set:Genewise using arthropod

proteins

Set 3EST gene set:

Exonerate using species-specific ESTs + combiner

Set 4 Metazoan similarity set:Genewise using all other

metazoan proteins

Set 5 ab initio gene set:SNAP + require Pfam

domain

Maskedgenome

Repeat masking

Rawgenome

Merge -giving priorityto higher-confidencesets

Page 7: VectorBase gene sets

Slide 7 of 18

VB automatic annotation:Repeat masking

• Several approaches (TRF, Recon, RepeatScout,RepeatMasker) are used to identify & mark repeatedsequences in the the genome assembly

• simple repeats, transposable elements etc• Using this repeat-masked genome sequence helps avoid

predicting bad ‘genes’• The repeat-masked sequence is available at VB from

each species’ genome home page

Page 8: VectorBase gene sets

Slide 8 of 18

VB automatic annotation: Gene sets• We make genome-wide sets of gene models using 3 main

approaches:• aligning various sets of protein sequences to the genome using

Genewise• aligning ESTs and combining them to make ‘EST genes’(both these approaches use the Ensembl system)• running an ab initio gene predictor called SNAP (from Ian Korf).

• We then combine these sets, prioritizing the higher confidencegenes, and adding in lower confidence ones only where thereare gaps to be filled (illustrated on next slide).

• We may also combine a protein-based and an EST-basedmodel to produce a protein-based model with its untranslated 5’& 3’ regions (UTRs)

• The next slide shows how we combine the sets - but be awarethat the details are tailored to suit different species

Page 9: VectorBase gene sets

Slide 9 of 18

VB automatic annotation: gap filling

Set 4

Set 2

Set 3

Set I

Set 5

Targeted

Arthropod

EST-based

Metazoan

Ab initio

Gene set beingassembled

The 2 genes from the Targeted set 1 have been placed, and one gene fromset 2 can be added into a gap. We will subsequently add single genes fromsets 3 & 4, but nothing from set 5.

Page 10: VectorBase gene sets

Slide 10 of 18

Combining annotation from VB andcollaborators

• In most of our projects, the initial gene annotation was producedin collaboration with the J Craig Venter Institute (JCVI) &/or theBroad Institute

• Each of the collaborating institutes generated a gene set• Approaches included EST-based modeling using PASA,

Genewise, ab initio program such as Augustus etc.• All sets were then merged into one:

– No alternative transcripts (a limited number were added later insome species)

– Genes with compatible structures: keep the longest– Overlapping genes with different structures: keep the best-

supported– Where ab initio model only: eliminate short ones unless similar to

known protein or domain– Re-screen to eliminate CDS from transposable elements

Page 11: VectorBase gene sets

Slide 11 of 18

Limitations of gene sets

• Gene sets made by automated methods will never be perfect!• Also dependent on quality of the assembly (see next slide)• Genes may be missed

– gaps in assembly; lack of EST or protein-homology evidence in thedatabases

• Genes may be incomplete– gaps in assembly; inability to model less-conserved start & end

exons• Merges & splits

– adjacent genes may occasionally be merged into one model– partial support or gaps can lead to one ‘real’ gene being split into

two or more models

Page 12: VectorBase gene sets

Slide 12 of 18

Genome assembly issues• Whole genome shotgun sequencing projects are often assessed on

coverage and on number & average size of contigs and supercontigs– the VB projects have quite high coverage (bases of sequence generated

>6X number of bases in genome)– but many gaps are still present in all our assemblies

• Polymorphism problems– VB animals are small, and the DNA for sequencing comes from many

individuals which may have significant genetic diversity– causes assembly problems including artifactual duplications/deletions and

missed regions• Repeat problems

– Genomes with high levels of repeated sequences are harder to assembleand, in trying to mask the repeats, gene families can occasionally bemasked

• Remember, as well as the assembly, the raw traces (sequence reads)are also available and can be searched.

Page 13: VectorBase gene sets

Slide 13 of 18

Gene set comparisonsDecember 2008

predicted genemodels# supercontigsAssembly

length

20,486 genes369,495supercontigs1.77 GbIxodes

Anopheles

Aedes

Culex

280 Mb

1.38 Gb

580 Mb

12,945 genes8,987 supercontigs:(4,654 ordered on 5chromosome arms)

15,419 genes4,758supercontigs

18,883 genes3,171supercontigs

Can you conclude that Ixodes & Culex really have more genes? No - theymight, but the number of predictions depends on the state of the assembly andgene annotation as well. For example, the number of predicted genes forAnopheles decreased in the first revisions as bad predictions were eliminated,and is now set to increase again as a result of detailed manual annotation.

Page 14: VectorBase gene sets

Slide 14 of 18

Issues for different speciesBy now you will be aware that all gene sets, including those at VB, need to be usedwith a degree of caution. Here are a few additional points for each of the VBspecies, emphasizing how they differ.

Anopheles gambiae

The only assembly where scaffolds have been assigned to chromosomes; knownpolymorphism issues partially addressed; gene set now in its fourth version; gene setincludes much manual annotation.

Aedes aegypti

Genome much larger & with higher repeat content than the other mosquitoes.

Culex quinquefasciatus

Higher gene count may reflect some real family expansion but may also be someoverprediction.

Ixodes scapularis

Large genome; high level of polymorphism leading to assembly with many gaps andlarge number of separate supercontigs. Gene set expected to be missing genes andto include models that may be incomplete.

Page 15: VectorBase gene sets

Slide 15 of 18

Manual & community input canimprove gene sets

• Automatic annotation can be applied to a whole genomerelatively rapidly and although it has limitations, these canbe taken into account when making use of the gene set.

• Expert manual annotation can improve the structures ofindividual genes, but is a slow process– VB has carried out some systematic manual annotation -

mostly on Anopheles so far– VB has also done targeted manual annotation leading to

correction of some models in all 3 mosquito species– Community annotation for individual genes is welcomed and

can be submitted via our Community Annotation system -read more in the tutorial here:

http://www.vectorbase.org/Help/VectorBase_tutorials

Page 16: VectorBase gene sets

Slide 16 of 18

Anopheles browser showing manually-annotated models on chromsome arm 2R

The manual annotator suggests merging 2 existing modelsand changing the structure of another. These changes willbe incorporated into build 5 of the Anopheles gene set.

Page 17: VectorBase gene sets

Slide 17 of 18

Adding annotations to gene models

• VB adds value by automatically annotating features of genemodels

• Protein features, including:– transmembrane regions & signal peptides– families (Prints, TIGRFam etc)– domains (Pfam etc)

• Cross references to other resources– database records that may represent the same gene– GO terms

• Community annotations are also welcomed - see the guidehere:

http://www.vectorbase.org/Help/Community_Annotation:Submission_User_Guide

Page 18: VectorBase gene sets

Slide 18 of 18

Further information and help

VectorBase help documentation starts athttp://www.vectorbase.org/Help/Main_Page

Please email the VectorBase help desk with any furthercomments or questions. The address is:[email protected]