the vertebrate genome annotation database

45
Vega and Community Manual Annotation Jane Loveland Havana group, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK Photo byMaj Britt Hansen PAG XX, 15 th January 2012

Upload: embl-ebi

Post on 10-May-2015

897 views

Category:

Technology


2 download

DESCRIPTION

Event: Plant and Animal Genomes conference Speaker: Jane Loveland (Wellcome Trust Sanger Institute The Human and Vertebrate Analysis and Annotation (HAVANA) team at the Wellcome Trust Sanger Institute (WTSI) undertakes manual annotation of finished vertebrate genomic sequence. This annotation is available from the Vertebrate Genome Annotation database (VEGA) (http://vega.sanger.ac.uk) and is in progress for whole genomes (human, mouse and zebrafish) and specific regions of interest, such as pig (SLA and LRC), dog (DLA), wallaby (MHC), gorilla (MHC and LRC). Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute are being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. We introduce the “Blessed” annotator and “Gatekeeper” approach to Community Annotation using the Otterlace/ZMap genome annotation tool. We also describe the strategies adopted for annotation consistency, quality control and viewing of the annotation. We are currently involved in the annotation of immune response genes in pig and have annotated over 1700 genes in a community annotation effort, together with the sequencing and annotation of pig chromosomes X and Y at WTSI that will be displayed on the VEGA website. Presented on behalf of the HAVANA team, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs UK.

TRANSCRIPT

Page 1: The Vertebrate Genome Annotation Database

Vega and Community Manual Annotation

Jane Loveland Havana group, Wellcome Trust Sanger Institute,

Hinxton, Cambridge, UK

Pho

to b

yMaj

Brit

t Han

sen

PAG XX, 15th January 2012

Page 2: The Vertebrate Genome Annotation Database

Havana: Human and vertebrate analysis and annotation •  Manual annotation of human, mouse and zebrafish

whole chromosomes or genomes •  Human ENCODE, mouse EUCOMM annotation •  Annotation of specific regions: human MHC & LRC

haplotypes, multiple species MHCs & LRCs,

Vega: Vertebrate Genome Annotation •  Ensembl derived browser focusing on manual

annotation

Page 3: The Vertebrate Genome Annotation Database

Overview

•  Manual annotation process: tools/pipeline/access of data (VEGA)

•  Community Manual Annotation –

Mouse (KOMP and NorCOMM)

Swine autosomes (IRAG)

Page 4: The Vertebrate Genome Annotation Database

Do we know how many genes there are?

Page 5: The Vertebrate Genome Annotation Database

Automatic Annotation vs Manual

Automatic Annotation •  Quick whole genome analysis ~

weeks •  Consistent annotation •  Use unfinished/illumina

sequence/shotgun assembly •  No polyA sites/signals,

pseudogenes •  Predicts ~75% loci

Manual Annotation •  Extremely slow~3 months

Chr 6 •  Need finished (high quality)

seq •  Flexible, can deal with

inconsistencies in data •  Most rules have exception •  Consult publications as well

as databases

Page 6: The Vertebrate Genome Annotation Database

•  manual annotation of genomic sequence (finished and unfinished)

•  every exon of every transcript supported by homology (mRNA / EST / protein)

•  splice variants •  pseudogenes •  nomenclature •  gene clusters

•  interpretation of problematic evidence

•  examination of literature

Manual annotation:

Page 7: The Vertebrate Genome Annotation Database

DAS=Distributed Annotation system

Analysis and Annotation pipeline: Otterlace/ZMap

BLAST Gene predictions

RepeatMasker CpG prediction

Pfam RefSeq

Ensembl

Page 8: The Vertebrate Genome Annotation Database

Annotation interface: datasets

Page 9: The Vertebrate Genome Annotation Database

Ana_notes: interface to record annotation history

Page 10: The Vertebrate Genome Annotation Database

A  

C  

B  

D  

E  

Homology matches EST Vertebrate mRNA Manual

annotation (in red and green)

Page 11: The Vertebrate Genome Annotation Database

Splicing checked via viewing cDNA alignments in “blixem”

Page 12: The Vertebrate Genome Annotation Database

Dotter can be used to align against unmasked sequence (reveal small exons)

Page 13: The Vertebrate Genome Annotation Database

DAS (distributed annotation system) source visible in Zmap

Page 14: The Vertebrate Genome Annotation Database

Search Pfam on the fly in otterlace

C21orf88 no pfamA domains

Page 15: The Vertebrate Genome Annotation Database

AAAAA

cDNAs

ESTs

Genomic sequence

Protein

Annotation based on transcriptional evidence.

Manual Annotation and Biotypes:

Protein Coding Known_CDS Novel_CDS Putative_CDS Nonsense_mediated_decay

Transcript Non_coding LincRNA Antisense Sense intronic Sense overlapping 3’ overlapping ncRNA Retained_intron Putative Artefact

Pseudogene Processed_pseudogene Unprocessed_pseudogene Transcribed_processed Transcribed_unprocessed Unitary_pseudogene Polymorphic_pseudogene

Page 16: The Vertebrate Genome Annotation Database

Identification of NMD:

TIBs Vol 33:8

Page 17: The Vertebrate Genome Annotation Database

GENE2-AS1

GENE2-UA1

GENE2-OT1

5 kb 30 kb

GENE1-IT1

GENE 1 (Coding) GENE 2 (Coding)

GENE1-AS1

GENE2-AS2

GENE2-AS3

antisense strand

lincRNA

GENE2-GENE1-AS1

sense strand 5’

5’

Transcript biotypes: Schematic of lncRNA

Page 18: The Vertebrate Genome Annotation Database

Pseudogene Loci:

Unprocessed

Duplication Reverse transcription and re-integration

AAAAAA

Processed

AAAAAA

* * **

Page 19: The Vertebrate Genome Annotation Database

HAVANA Pseudogene Loci examples:

Unprocessed Processed

C20orf45 pseudogene

LILR family pseudogene

Pseudo_polyA signal

Page 20: The Vertebrate Genome Annotation Database

Transcribed processed pseudogene: functional ? (OTTHUMT00000130640)!

Processed pseudogene Poly-A annotation

EST evidence - some 100%

Protein evidence - contains stop codons

Pseudogene prediction

Page 21: The Vertebrate Genome Annotation Database

Gene Structure - 5’ End

DNASE1L1

All variants share ATG

Page 22: The Vertebrate Genome Annotation Database

Gene Structure - 3’ End

polyA sites and signals

RBM12

Page 23: The Vertebrate Genome Annotation Database

Vega: Portal for the data

Page 24: The Vertebrate Genome Annotation Database
Page 25: The Vertebrate Genome Annotation Database

Locus summary:

CCDS

Annotation date

xrefs

Page 26: The Vertebrate Genome Annotation Database

Evidence used to build transcripts

Page 27: The Vertebrate Genome Annotation Database

Linked loci ATP50

DONSON

CRYZL1 ~300kb

Page 28: The Vertebrate Genome Annotation Database

HOXA gene cluster Human chr 7p15.2 Mouse chr 6qB3

HIT18844

HOXA11AS

HOTAIRM1

Long non-coding transcripts are conserved across species and regulate expression of HOX genes

Page 29: The Vertebrate Genome Annotation Database

TNSF12

TNSF13

TNSF12-TNSF13

Readthroughs/fusion proteins:

Page 30: The Vertebrate Genome Annotation Database

Human haplotypes in VEGA: MHC:

Reference (PGF) 6-COX 6-QBL 6-SSTO 6-APD 6-DBB 6-MANN 6-MCF

LRC: 19-COX 19-PGF_1

Other species MHC:

Page 31: The Vertebrate Genome Annotation Database

Multicontigview: Compare regions in MHC between pig and human

Page 32: The Vertebrate Genome Annotation Database

Community Annotation: • Otterlace/Zmap software available for Mac and Linux

• Part of IKMC with EUCOMM annotation in mouse: • KOMP and NorCOMM annotation

• Jamborees for species with strong community interest:

• Xenopus tropicalis 2005 (cDNA)

• Cow 2007 (Genomic WGS) Publication

• Pig • 2008 (Genomic WGS) • 2010 - 2012

• IR genes in Pig (~2000 genes) manually annotated by community • Chromosomes X and Y to be fully finished and annotated by Havana

Page 33: The Vertebrate Genome Annotation Database

Community Annotation Approaches: The value of a genome is only as good as its annotation

Otterlace/Zmap Annotation Software: Anacode team Authentication:

Sanger single sign-on account (email) Registered email for otterlace permitted users:

Access to our data and and analysis pipeline

“Blessed Annotator” Mouse KOMP and NorCOMM (part of IKMC) External annotators (3) trained from Wash U and U Manitoba Identifying critical exons to make knock-out constructs 6 months of training and QC – Integrated into mouse gene build

“Gatekeeper” Swine autosomes External annotators worldwide (~30) Short training for European and US groups, plus regular calls and WebEx Guidance and QC by WTSI

Page 34: The Vertebrate Genome Annotation Database

Dnhd1

Ensembl 64 prediction

Repeats

Knock out construct

Critical exon

Ensembl 64 prediction

A   B  

Mouse KOMP annotation

Page 35: The Vertebrate Genome Annotation Database

View KO’s in Vega

Page 36: The Vertebrate Genome Annotation Database

Swine Immune Response Annotation Group (IRAG)

Chris Tuggle (Iowa State), Claire Rogel-Gaillard (INRA)

USA: Iowa State China: Huazong Agricultural University USDA Michigan State Europe: INRA Univ Minnesota Parco Tecnologico Padano Oaklahoma State Roslin Kansas State UCL WTSI

Japan: AFFRC STAFF ~30 annotators!

Page 37: The Vertebrate Genome Annotation Database

Unordered contigs in pig

Fragmented manual Annotation of CRISPLD2 (in red and green).

Grey vertical bars represent unordered contigs.

Vertebrate mRNA homology matches

Page 38: The Vertebrate Genome Annotation Database

A  

B  

Manual annotations (in green and red), including several splice variants.

Ensembl prediction track

Ensembl prediction track (in red)

DNA and Protein evidence  

Manual Annotation (DAS)

Gene duplications in pig

REG3A gene

Page 39: The Vertebrate Genome Annotation Database

Missing genes from pig? Ensembl multi-species view

Family of glycoproteins, related to class 1 MHC. Activate natural kller T cells Katherine Mann

Page 40: The Vertebrate Genome Annotation Database

CD1D

CD1B

CD1 family pseudogene

Manual annotation:

Expansions in cow and dog, but not pig?

Page 41: The Vertebrate Genome Annotation Database

Comparative annotation: S

EP

T6

Pig Human Mouse

Denise Carvalho-Silva

Page 42: The Vertebrate Genome Annotation Database

Community Annotation: Summary

“Blessed Annotator”: Extended training means less QC Wide range of biotypes Annotation figures: KOMP 1876 genes, NorCOMM 378 genes

“Gatekeeper”: Shorter training means more QC Annotation figures: Pig IRAG 1276 genes

Page 43: The Vertebrate Genome Annotation Database

Lessons Learned: QC

How to maintain quality with diverse annotation expertise

Training Essential, plus regular updates (WebEx)

Nomenclature Swine Nomenclature Committee

What next: Merge the IRAG manual annotation with the automated Ensembl annotation:

~ 8% of genome

Extend annotation / collaboration:

Refined gene list for IRAG

QC: Complex and partial genes

Publications

Page 44: The Vertebrate Genome Annotation Database

Acknowledgements Havana: Jen Harrow If Barnes Ruth Bennett Alex Bignell Veronika Boychenko Gloria Despacio-Reyes Sarah Donaldson Adam Frankish Matthew Hardy Toby Hunt Mike Kay Gavin Laird David Lloyd Jane Loveland Deepa Manthravadi Gaurab Mukherjee Jonathan Mudge Jeena Rajan Liam Redgrave Gary Saunders Catherine Snow Charles Steward Marie-Marthe Suner Mark Thomas Laurens Wilming

Anacode: James Gilbert Matthew Astley Michael Gray Jeremy Henty

Vega: Stephen Trevanion Maurice Hendrix

Zmap: Ed Griffiths Gemma Barson Malcolm Hinsley

Mouse annotation:

KOMP: Amy Horton

NorCOMM: Molly Pind

http://vega.sanger.ac.uk

IRAG annotators:

USA: Jim Reecy Chris Tuggle Daniel Berman Frank Blecha Ryan Chen Celine Chen Daniel Ciobanu Harry Dawson Cathy Earnst Zhiliang Hu Joan Lunney Katherine Mann Michael Murtaugh Yongming Sang John Schwartz

China: Shuhong Zhao

Japan: Hirohide Huenishi Takeya Morozumi Hiroke Sinkai Diasuke Toki

Europe: Alan Archibald Claire Rogel-Gaillard Anna Anselmo Bouabid Badauoi Betrand Bed’Hom Dario Beraldi Lynsey Fairbairn Elisabetta Giuffra David Hume Ronan Kapetanovich Dennis Prickett Christelle Robert Yasu Takeuchi

Page 45: The Vertebrate Genome Annotation Database