ensembl genome annotation overview
Embed Size (px)
DESCRIPTION
Ensembl Genome Annotation Overview. Steve Searle. Outline. Overview of Ensembl Outline of gene annotation process Details of the gene annotation process on high coverage genomes including recent modifications to it. Ensembl. Joint Sanger / EBI project (~40 people) - PowerPoint PPT PresentationTRANSCRIPT

Outline
Overview of Ensembl
Outline of gene annotation process
Details of the gene annotation process on high
coverage genomes including recent modifications to it

EnsemblJoint Sanger / EBI project (~40 people)
Aim: Annotate Vertebrate genomes
Multiple teams
Genebuild
Compara
Functional Genomics
Variation
Core API
Mart
Web
Release every 2 months

Anopheles gambiae
Aedes aegyptiDrosophila melanogaster
Dasypus novemcinctus
Loxodonta africana
Echinops telfairi
Tupaia belangeri
Homo sapiens
Pan troglodytes
Macaca mulatta
Otolemur garnettiiMus musculus
Rattus norvegicusSpermophilus tridecemlineatusCavia porcellus
Oryctolagus cuniculus
Erinaceus europaeus
Myotis lucifugus
Canis familiaris
Felis catus
Bos taurusMonodelphis domesticaOrnithorhynchus anatinus
Gallus gallusXenopus tropicalis
Gasterosteus aculeatusOryzias latipes
Takifugu rubripes
Tetraodon nigroviridis
Danio rerioCiona intestinalis
Ciona savignyi
Caenorhabditis elegans
Saccharomyces cerevisiae
Species in Ensembl

Types of annotation in EnsEMBL
•Raw computes•Gene annotation
–Protein coding–Pseudogenes–ncRNAs
•Xrefs•Protein annotation•Comparative genomic data
–Genomic alignments–Gene homologs
•Variation data•Functional genomic data
–Probe mapping–ChipSeq –Regulatory region build

Outline
Overview of Ensembl
Outline of gene annotation process
Details of the gene annotation process on high coverage
genomes including recent modifications to it

Ensembl Genebuilding
Ensembl generates automatic annotation on over 30 genomes.
Annotation strategies:High coverage
Alignments of protein and cDNA sequences.Low coverage
Project gene structures from reference species through genomic alignment
High coverage primateAlignment of species specific sequences + gene projection
Fly, yeast, wormImport manual annotation(worm now done by wormbase team)

Data used for building on high coverage genomes
Build CDS models usingUniprot/Swissprot Uniprot/TrEMBLRefseq NPs
cDNAsEMBL/GenBank
Now also use:IMGT data
Other transcript models incorporatedCCDS models Subset of Havana manual annotation

Genewise / Exonerate
CDSModels
Other proteins
TranscriptModels (no CDS)
Species specific cDNAs
CDS models with UTRs
Genebuilder
Core EnsEMBLGenes
PseudogenesFinal gene set
Aligned ESTs
Species specfic ESTs
EnsEMBLEST genes
Species specific proteins
Pmatch
Outline of Genebuild Process
ClusterMerge
ExonerateExonerate
UTR Addition

Targetted Genewise overview
Blast, cluster hits to define region to build model
Species specific protein sequencesUniprot/Swissprot,Uniprot/TrEMBL,Refseq (NPs)
pmatch vs assembly
genewise

Similarity Genewise overview
reblast and optionally miniseq
genewise
Blast vs repeatmasked Slice, not just genscan Exons (coverage threshold)
Fetch protein blast hits Score threshold
Discard hits overlapping targetted genewises

Adding UTRs
Translateable gene with UTRs
Exonerate – UTRs, no phases
Genewise – phases, no UTRs

Outline
Overview of Ensembl
Outline of gene annotation process
Details of the gene annotation process on high coverage
genomes including recent modifications to it

New human and mouse gene builds
Modifications to build methodImproving predicted gene structures
•Improved alignments of species specific data
•Filtering out problematic sequences–chimeric cDNAs–retained introns–incorrect TrEMBL CDS entries
•UTR addition modified
•Immunoglobulin build
•Updated manual annotation set incorporated
•Homology data used to improve incomplete models
Reducing overprediction•Improved pseudogene detection
•Filtering out problematic sequences–viral proteins–repeats
Reducing underprediction•Homology data used to identify missed orthologs
•Updated manual annotation set incorporated

QC of current and previous human and mouse Ensembl
gene sets
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCBI36 (47) SPNCBI36 (44) SPNCBI36 (47) RS PNCBI36 (44) RS PNCBI36 (47) RS RNCBI36 (44) RS RNCBIM37 (47) SPNCBIM36 (44) SP
NCBIM37 (47) RS PNCBIM36 (44) RS PNCBIM37 (47) RS RNCBIM36 (44) RS R
missingmatchingperfect edgesperfect identity
MouseHuman

CDS models
Other proteins
Transcript models
Species specific cDNAs
CDS models with UTRs
Genebuilder
Preliminarygene set
GeneBuilder
Pseudogenes
Aligned ESTs
Species specfic ESTs
EnsEMBLEST genes
Transcript-Coalescer
ExonerateExonerate /cDNAUpdate
Genewise
Genewise Exonerate
BestTargetted
TranscriptConsensus
HavanaAdder
Pmatch/Exonerate
Havana Manual annotation
Final Gene Set
KillList Filter KillList Filter KillList Filter
CCDS Transcripts
Species specific proteins
Ditag alignments
Fixed partial andMissed orthologs
Core Ensembl Genes with UTR
UTR Addition
IMGT Ig Genes
Exonerate
Comparativeanalysis
SetMerger

Species specific protein alignment
Aim: Improved alignment of species specific proteins
Problems:– Non GT-AG spliced introns were not always correctly predicted– Short terminal exons were sometimes missed
Solution:– Produce multiple models for each input protein
•Pmatch with shorter word length•Genewise with different parameters•Exonerate protein2genome model
– Choose best model (most similar to original protein)

Species specific protein alignment
NCBI NPs matched
Genewise 64%
+Non-consensus splice sites 74.2%
+Pmatch 14 81.55%
+ Exonerate, BestTargetted 85.67%

Exonerate cdna2genome model
Gene Transcript
TargettedGenewise 77.88% 74.19%
BestTargetted with Genewise and Exonerate 88.26% 85.67%
Exonerate cdna2genome (coding exons only) 86.65% 82.21%
Exonerate cdna2genome (all coding exons except for the start of the first exon and end of the last exon)
86.22% 81.51%
BestTargetted with Genewise, Exonerate and Exonerate cdna2genome
95.18% 93.78%
Combined model aligning cDNA and CDS simultaneously should improve transcript models

Prevent alignments of ‘bad’ cDNAs and proteins to the genome
KillList database replaced kill list file–More flexible than a text file–Kill List as it was at any date–Species and analysis specific kills–Versioning–API access
Input data filtering
Number of sequences in database
Reason for being killed
3148 Retained intron
919 Chimeric clone
657 Chimeric cDNA
384 Viral
312 Env (retroviral)
250 Partial sequence
230 Hits many times
219 Pol (retroviral)
150 LINE
115 Gag (retroviral)

UTR addition
UTR addition module rewritten to incorporate previously separate stages
LookForBoth• Expanding out incomplete CDSs into added UTR
KnownUTR• Makes known cDNA protein pairings
and greater flexibility:Ditag and EST data can be used to score cDNAs
Other changes:cDNA update pipeline for cDNA alignments in human and mouse rather than single exonerate run
–3 rounds of exonerate with increasingly sensitive parameters–Aligns 87% of 231145 cDNAs in human
Chimeric cDNA and retained intron cDNAs kill listed

Pseudogenes and artefacts
Using protein domains to identify ensembl families of viral origin
–In first run more than 3000 genes have been removed across all species–Evidence has been kill listed–Patch applied to 49 release removing 1200 more (and 400 Alus)
Better identification of processed pseudogenes
– new biotype ‘retrotransposed’

CCDSSet of CDS structures identical between Ensembl / Havana and NCBI:
– assessed by UCSC
– guaranteed to be retained in both sets
– removal requires agreement from Hinxton, NCBI and UCSC
Extended to mouse in 2007.
Current set sizes:
16004 human genes
16892 mouse genes (up from 12981 in previous set)
New round of comparisons on human in progress

Incorporate Havana transcripts with complete CDS
Redundant transcripts in merged set removed
Merging process pipelined for new human and mouse builds and updated vega dbs used.
Havana annotation merging
Species Ensembl Ensembl/Havana Havana
Human 12942 9419 379
Mouse 14142 8802 105

Immunoglobulin build
Aim: Improve annotation of Immunoglobulin gene segment clusters
Problem: Ig gene segments cause problems for standard build procedure
cDNAs and proteins do not represent the germline genomic arrangement
Solution: Incorporate IMGT data –Align IMGT Ig segment annotation to genome–Replace overlapping genes built by standard build procedure
Currently only used in human and mouse

Clustering rulesFalse gene merging
Transcript models are clustered into genes based on exon overlap
Old rule: Any exons overlap
New rule: Only overlap of CDS exons

StatusHuman and mouse gene builds were the main focus for 2007
They have been the driver for a significant amount of development workResulting builds are the best we have created so farWe’ve produced patches to further improve the current sets
Problems which remainSequence data which causes problems for prediction ‘Chimeric’ clones Fragmentary protein and cDNA sequences Translated 3' UTR
Gene clustersCDS overlap as definition of gene leads to some cases that don't agree with manual definition of gene

AcknowledgementsGenebuilders
Val CurwenBronwen AkenJulio Fernandez BanetLaura ClarkeAmonida ZadissaFelix KokocinskiJan-Hinnerk VogelSimon White
Sarah Dyer
Tim HubbardEwan Birney
Michael Schuster
The rest of Ensembl
ISG (Systems support)Guy Coates
HavanaJen HarrowLaurens Wilming
NCBI Kim Pruitt
UCSCMark Diekhans