ensembl genome annotation overview

of 31 /31
Ensembl Genome Annotation Overview Steve Searle

Upload: cana

Post on 30-Jan-2016




0 download

Embed Size (px)


Ensembl Genome Annotation Overview. Steve Searle. Outline. Overview of Ensembl Outline of gene annotation process Details of the gene annotation process on high coverage genomes including recent modifications to it. Ensembl. Joint Sanger / EBI project (~40 people) - PowerPoint PPT Presentation


Page 1: Ensembl Genome Annotation Overview

Ensembl Genome Annotation Overview

Steve Searle

Page 2: Ensembl Genome Annotation Overview


Overview of Ensembl

Outline of gene annotation process

Details of the gene annotation process on high

coverage genomes including recent modifications to it

Page 3: Ensembl Genome Annotation Overview

EnsemblJoint Sanger / EBI project (~40 people)

Aim: Annotate Vertebrate genomes

Multiple teams



Functional Genomics


Core API



Release every 2 months

Page 4: Ensembl Genome Annotation Overview

Anopheles gambiae

Aedes aegyptiDrosophila melanogaster

Dasypus novemcinctus

Loxodonta africana

Echinops telfairi

Tupaia belangeri

Homo sapiens

Pan troglodytes

Macaca mulatta

Otolemur garnettiiMus musculus

Rattus norvegicusSpermophilus tridecemlineatusCavia porcellus

Oryctolagus cuniculus

Erinaceus europaeus

Myotis lucifugus

Canis familiaris

Felis catus

Bos taurusMonodelphis domesticaOrnithorhynchus anatinus

Gallus gallusXenopus tropicalis

Gasterosteus aculeatusOryzias latipes

Takifugu rubripes

Tetraodon nigroviridis

Danio rerioCiona intestinalis

Ciona savignyi

Caenorhabditis elegans

Saccharomyces cerevisiae

Species in Ensembl

Page 5: Ensembl Genome Annotation Overview

Types of annotation in EnsEMBL

•Raw computes•Gene annotation

–Protein coding–Pseudogenes–ncRNAs

•Xrefs•Protein annotation•Comparative genomic data

–Genomic alignments–Gene homologs

•Variation data•Functional genomic data

–Probe mapping–ChipSeq –Regulatory region build

Page 6: Ensembl Genome Annotation Overview


Overview of Ensembl

Outline of gene annotation process

Details of the gene annotation process on high coverage

genomes including recent modifications to it

Page 7: Ensembl Genome Annotation Overview

Ensembl Genebuilding

Ensembl generates automatic annotation on over 30 genomes.

Annotation strategies:High coverage

Alignments of protein and cDNA sequences.Low coverage

Project gene structures from reference species through genomic alignment

High coverage primateAlignment of species specific sequences + gene projection

Fly, yeast, wormImport manual annotation(worm now done by wormbase team)

Page 8: Ensembl Genome Annotation Overview

Data used for building on high coverage genomes

Build CDS models usingUniprot/Swissprot Uniprot/TrEMBLRefseq NPs


Now also use:IMGT data

Other transcript models incorporatedCCDS models Subset of Havana manual annotation

Page 9: Ensembl Genome Annotation Overview

Genewise / Exonerate


Other proteins

TranscriptModels (no CDS)

Species specific cDNAs

CDS models with UTRs


Core EnsEMBLGenes

PseudogenesFinal gene set

Aligned ESTs

Species specfic ESTs

EnsEMBLEST genes

Species specific proteins


Outline of Genebuild Process



UTR Addition

Page 10: Ensembl Genome Annotation Overview

Targetted Genewise overview

Blast, cluster hits to define region to build model

Species specific protein sequencesUniprot/Swissprot,Uniprot/TrEMBL,Refseq (NPs)

pmatch vs assembly


Page 11: Ensembl Genome Annotation Overview

Similarity Genewise overview

reblast and optionally miniseq


Blast vs repeatmasked Slice, not just genscan Exons (coverage threshold)

Fetch protein blast hits Score threshold

Discard hits overlapping targetted genewises

Page 12: Ensembl Genome Annotation Overview

Adding UTRs

Translateable gene with UTRs

Exonerate – UTRs, no phases

Genewise – phases, no UTRs

Page 13: Ensembl Genome Annotation Overview


Overview of Ensembl

Outline of gene annotation process

Details of the gene annotation process on high coverage

genomes including recent modifications to it

Page 14: Ensembl Genome Annotation Overview

New human and mouse gene builds

Modifications to build methodImproving predicted gene structures

•Improved alignments of species specific data

•Filtering out problematic sequences–chimeric cDNAs–retained introns–incorrect TrEMBL CDS entries

•UTR addition modified

•Immunoglobulin build

•Updated manual annotation set incorporated

•Homology data used to improve incomplete models

Reducing overprediction•Improved pseudogene detection

•Filtering out problematic sequences–viral proteins–repeats

Reducing underprediction•Homology data used to identify missed orthologs

•Updated manual annotation set incorporated

Page 15: Ensembl Genome Annotation Overview

QC of current and previous human and mouse Ensembl

gene sets












NCBI36 (47) SPNCBI36 (44) SPNCBI36 (47) RS PNCBI36 (44) RS PNCBI36 (47) RS RNCBI36 (44) RS RNCBIM37 (47) SPNCBIM36 (44) SP

NCBIM37 (47) RS PNCBIM36 (44) RS PNCBIM37 (47) RS RNCBIM36 (44) RS R

missingmatchingperfect edgesperfect identity


Page 16: Ensembl Genome Annotation Overview

CDS models

Other proteins

Transcript models

Species specific cDNAs

CDS models with UTRs


Preliminarygene set



Aligned ESTs

Species specfic ESTs

EnsEMBLEST genes


ExonerateExonerate /cDNAUpdate


Genewise Exonerate





Havana Manual annotation

Final Gene Set

KillList Filter KillList Filter KillList Filter

CCDS Transcripts

Species specific proteins

Ditag alignments

Fixed partial andMissed orthologs

Core Ensembl Genes with UTR

UTR Addition

IMGT Ig Genes




Page 17: Ensembl Genome Annotation Overview

Species specific protein alignment

Aim: Improved alignment of species specific proteins

Problems:– Non GT-AG spliced introns were not always correctly predicted– Short terminal exons were sometimes missed

Solution:– Produce multiple models for each input protein

•Pmatch with shorter word length•Genewise with different parameters•Exonerate protein2genome model

– Choose best model (most similar to original protein)

Page 18: Ensembl Genome Annotation Overview

Species specific protein alignment

NCBI NPs matched

Genewise 64%

+Non-consensus splice sites 74.2%

+Pmatch 14 81.55%

+ Exonerate, BestTargetted 85.67%

Page 19: Ensembl Genome Annotation Overview

Exonerate cdna2genome model

Gene Transcript

TargettedGenewise 77.88% 74.19%

BestTargetted with Genewise and Exonerate 88.26% 85.67%

Exonerate cdna2genome (coding exons only) 86.65% 82.21%

Exonerate cdna2genome (all coding exons except for the start of the first exon and end of the last exon)

86.22% 81.51%

BestTargetted with Genewise, Exonerate and Exonerate cdna2genome

95.18% 93.78%

Combined model aligning cDNA and CDS simultaneously should improve transcript models

Page 20: Ensembl Genome Annotation Overview

Prevent alignments of ‘bad’ cDNAs and proteins to the genome

KillList database replaced kill list file–More flexible than a text file–Kill List as it was at any date–Species and analysis specific kills–Versioning–API access

Input data filtering

Number of sequences in database

Reason for being killed

3148 Retained intron

919 Chimeric clone

657 Chimeric cDNA

384 Viral

312 Env (retroviral)

250 Partial sequence

230 Hits many times

219 Pol (retroviral)

150 LINE

115 Gag (retroviral)

Page 21: Ensembl Genome Annotation Overview

‘Chimeric’ cDNA example

Page 22: Ensembl Genome Annotation Overview

UTR addition

UTR addition module rewritten to incorporate previously separate stages

LookForBoth• Expanding out incomplete CDSs into added UTR

KnownUTR• Makes known cDNA protein pairings

and greater flexibility:Ditag and EST data can be used to score cDNAs

Other changes:cDNA update pipeline for cDNA alignments in human and mouse rather than single exonerate run

–3 rounds of exonerate with increasingly sensitive parameters–Aligns 87% of 231145 cDNAs in human

Chimeric cDNA and retained intron cDNAs kill listed

Page 23: Ensembl Genome Annotation Overview

Pseudogenes and artefacts

Using protein domains to identify ensembl families of viral origin

–In first run more than 3000 genes have been removed across all species–Evidence has been kill listed–Patch applied to 49 release removing 1200 more (and 400 Alus)

Better identification of processed pseudogenes

– new biotype ‘retrotransposed’

Page 24: Ensembl Genome Annotation Overview

CCDSSet of CDS structures identical between Ensembl / Havana and NCBI:

– assessed by UCSC

– guaranteed to be retained in both sets

– removal requires agreement from Hinxton, NCBI and UCSC

Extended to mouse in 2007.

Current set sizes:

16004 human genes

16892 mouse genes (up from 12981 in previous set)

New round of comparisons on human in progress

Page 25: Ensembl Genome Annotation Overview

Incorporate Havana transcripts with complete CDS

Redundant transcripts in merged set removed

Merging process pipelined for new human and mouse builds and updated vega dbs used.

Havana annotation merging

Species Ensembl Ensembl/Havana Havana

Human 12942 9419 379

Mouse 14142 8802 105

Page 27: Ensembl Genome Annotation Overview

Immunoglobulin build

Aim: Improve annotation of Immunoglobulin gene segment clusters

Problem: Ig gene segments cause problems for standard build procedure

cDNAs and proteins do not represent the germline genomic arrangement

Solution: Incorporate IMGT data –Align IMGT Ig segment annotation to genome–Replace overlapping genes built by standard build procedure

Currently only used in human and mouse

Page 28: Ensembl Genome Annotation Overview

Ig Build

Rel 38

Rel 47

Page 29: Ensembl Genome Annotation Overview

Clustering rulesFalse gene merging

Transcript models are clustered into genes based on exon overlap

Old rule: Any exons overlap

New rule: Only overlap of CDS exons

Page 30: Ensembl Genome Annotation Overview

StatusHuman and mouse gene builds were the main focus for 2007

They have been the driver for a significant amount of development workResulting builds are the best we have created so farWe’ve produced patches to further improve the current sets

Problems which remainSequence data which causes problems for prediction ‘Chimeric’ clones Fragmentary protein and cDNA sequences Translated 3' UTR

Gene clustersCDS overlap as definition of gene leads to some cases that don't agree with manual definition of gene

Page 31: Ensembl Genome Annotation Overview


Val CurwenBronwen AkenJulio Fernandez BanetLaura ClarkeAmonida ZadissaFelix KokocinskiJan-Hinnerk VogelSimon White

Sarah Dyer

Tim HubbardEwan Birney

Michael Schuster

The rest of Ensembl

ISG (Systems support)Guy Coates

HavanaJen HarrowLaurens Wilming

NCBI Kim Pruitt

UCSCMark Diekhans