vectorbase gene sets
DESCRIPTION
An introduction to the gene sets in VectorBase - how they are made and how to use themTRANSCRIPT
Slide 1 of 18
VectorBase gene sets
A tutorial
Martin HammondVectorBase
European Bioinformatics InstituteDecember 2008
Slide 2 of 18
What this tutorial covers
• What is a VectorBase gene set?• How are the gene sets made?• What problems may there be with the gene models?• How are the sets affected by the genome assembly?• Are there different issues for the different organisms?• Can manual & community input improve gene sets?• How do gene models get their annotation?
Slide 3 of 18
What is a VB gene set?
VB provides a single ‘official’ gene set for its mainspecies: 3 mosquitoes (Anopheles, Aedes, Culex), plusthe tick Ixodes scapularis. VB & its collaborators makethe best set we can, initially using automated genemodeling systems.
The same set is the primary gene annotation on theGenBank / EMBL records for the sequence assembly.
Having an official set with stable identifiers makes it easierfor the research community to talk to one another aboutgenes. And when improvements are made to a gene set,we keep track of how old & new models are related, soany work on a gene won’t be lost.
Slide 4 of 18
What’s in a VB gene set?Most people are interested in protein-coding genes.
The gene structures VB presents are predictions(‘models’) built on the whole genome shotgunsequence assembly for the species. Many models willnot be exact representations of the ‘real’ gene in a realanimal, and some ‘real’ genes may be missing altogether.The reasons for this apply to all genome annotationprojects, not just to VB, and are discussed in latersections of this tutorial.
VB also presents non-coding RNA genes (ncRNA) suchas tRNAs, miRNAs, snRNAs etc.
Again, these are built on the assembly and are likely to beincomplete sets.
Slide 5 of 18
How are the gene sets made?• The initial gene set for each species is a collaboration
between VB and one or more of the institutes involved insequencing the genome
• The initial annotation is automated (as opposed tomanual) and uses a variety of approaches to find genes &predict their structures
• VB annotation is combined with sets produced by ourcollaborators using complementary approaches, toproduce the initial gene set
• VB then takes over ongoing curation and improvement ofthe gene set
• VB’s annotation procedure is outlined in the next fewslides
Slide 6 of 18
VB automatic annotation: overview
Set ITargetted set: Genewise
using species-specificprotein
Set 2Arthropod similarity set:Genewise using arthropod
proteins
Set 3EST gene set:
Exonerate using species-specific ESTs + combiner
Set 4 Metazoan similarity set:Genewise using all other
metazoan proteins
Set 5 ab initio gene set:SNAP + require Pfam
domain
Maskedgenome
Repeat masking
Rawgenome
Merge -giving priorityto higher-confidencesets
Slide 7 of 18
VB automatic annotation:Repeat masking
• Several approaches (TRF, Recon, RepeatScout,RepeatMasker) are used to identify & mark repeatedsequences in the the genome assembly
• simple repeats, transposable elements etc• Using this repeat-masked genome sequence helps avoid
predicting bad ‘genes’• The repeat-masked sequence is available at VB from
each species’ genome home page
Slide 8 of 18
VB automatic annotation: Gene sets• We make genome-wide sets of gene models using 3 main
approaches:• aligning various sets of protein sequences to the genome using
Genewise• aligning ESTs and combining them to make ‘EST genes’(both these approaches use the Ensembl system)• running an ab initio gene predictor called SNAP (from Ian Korf).
• We then combine these sets, prioritizing the higher confidencegenes, and adding in lower confidence ones only where thereare gaps to be filled (illustrated on next slide).
• We may also combine a protein-based and an EST-basedmodel to produce a protein-based model with its untranslated 5’& 3’ regions (UTRs)
• The next slide shows how we combine the sets - but be awarethat the details are tailored to suit different species
Slide 9 of 18
VB automatic annotation: gap filling
Set 4
Set 2
Set 3
Set I
Set 5
Targeted
Arthropod
EST-based
Metazoan
Ab initio
Gene set beingassembled
The 2 genes from the Targeted set 1 have been placed, and one gene fromset 2 can be added into a gap. We will subsequently add single genes fromsets 3 & 4, but nothing from set 5.
Slide 10 of 18
Combining annotation from VB andcollaborators
• In most of our projects, the initial gene annotation was producedin collaboration with the J Craig Venter Institute (JCVI) &/or theBroad Institute
• Each of the collaborating institutes generated a gene set• Approaches included EST-based modeling using PASA,
Genewise, ab initio program such as Augustus etc.• All sets were then merged into one:
– No alternative transcripts (a limited number were added later insome species)
– Genes with compatible structures: keep the longest– Overlapping genes with different structures: keep the best-
supported– Where ab initio model only: eliminate short ones unless similar to
known protein or domain– Re-screen to eliminate CDS from transposable elements
Slide 11 of 18
Limitations of gene sets
• Gene sets made by automated methods will never be perfect!• Also dependent on quality of the assembly (see next slide)• Genes may be missed
– gaps in assembly; lack of EST or protein-homology evidence in thedatabases
• Genes may be incomplete– gaps in assembly; inability to model less-conserved start & end
exons• Merges & splits
– adjacent genes may occasionally be merged into one model– partial support or gaps can lead to one ‘real’ gene being split into
two or more models
Slide 12 of 18
Genome assembly issues• Whole genome shotgun sequencing projects are often assessed on
coverage and on number & average size of contigs and supercontigs– the VB projects have quite high coverage (bases of sequence generated
>6X number of bases in genome)– but many gaps are still present in all our assemblies
• Polymorphism problems– VB animals are small, and the DNA for sequencing comes from many
individuals which may have significant genetic diversity– causes assembly problems including artifactual duplications/deletions and
missed regions• Repeat problems
– Genomes with high levels of repeated sequences are harder to assembleand, in trying to mask the repeats, gene families can occasionally bemasked
• Remember, as well as the assembly, the raw traces (sequence reads)are also available and can be searched.
Slide 13 of 18
Gene set comparisonsDecember 2008
predicted genemodels# supercontigsAssembly
length
20,486 genes369,495supercontigs1.77 GbIxodes
Anopheles
Aedes
Culex
280 Mb
1.38 Gb
580 Mb
12,945 genes8,987 supercontigs:(4,654 ordered on 5chromosome arms)
15,419 genes4,758supercontigs
18,883 genes3,171supercontigs
Can you conclude that Ixodes & Culex really have more genes? No - theymight, but the number of predictions depends on the state of the assembly andgene annotation as well. For example, the number of predicted genes forAnopheles decreased in the first revisions as bad predictions were eliminated,and is now set to increase again as a result of detailed manual annotation.
Slide 14 of 18
Issues for different speciesBy now you will be aware that all gene sets, including those at VB, need to be usedwith a degree of caution. Here are a few additional points for each of the VBspecies, emphasizing how they differ.
Anopheles gambiae
The only assembly where scaffolds have been assigned to chromosomes; knownpolymorphism issues partially addressed; gene set now in its fourth version; gene setincludes much manual annotation.
Aedes aegypti
Genome much larger & with higher repeat content than the other mosquitoes.
Culex quinquefasciatus
Higher gene count may reflect some real family expansion but may also be someoverprediction.
Ixodes scapularis
Large genome; high level of polymorphism leading to assembly with many gaps andlarge number of separate supercontigs. Gene set expected to be missing genes andto include models that may be incomplete.
Slide 15 of 18
Manual & community input canimprove gene sets
• Automatic annotation can be applied to a whole genomerelatively rapidly and although it has limitations, these canbe taken into account when making use of the gene set.
• Expert manual annotation can improve the structures ofindividual genes, but is a slow process– VB has carried out some systematic manual annotation -
mostly on Anopheles so far– VB has also done targeted manual annotation leading to
correction of some models in all 3 mosquito species– Community annotation for individual genes is welcomed and
can be submitted via our Community Annotation system -read more in the tutorial here:
http://www.vectorbase.org/Help/VectorBase_tutorials
Slide 16 of 18
Anopheles browser showing manually-annotated models on chromsome arm 2R
The manual annotator suggests merging 2 existing modelsand changing the structure of another. These changes willbe incorporated into build 5 of the Anopheles gene set.
Slide 17 of 18
Adding annotations to gene models
• VB adds value by automatically annotating features of genemodels
• Protein features, including:– transmembrane regions & signal peptides– families (Prints, TIGRFam etc)– domains (Pfam etc)
• Cross references to other resources– database records that may represent the same gene– GO terms
• Community annotations are also welcomed - see the guidehere:
http://www.vectorbase.org/Help/Community_Annotation:Submission_User_Guide
Slide 18 of 18
Further information and help
VectorBase help documentation starts athttp://www.vectorbase.org/Help/Main_Page
Please email the VectorBase help desk with any furthercomments or questions. The address is:[email protected]