chinnappa kodira april 2004 gmod 2004, cambridge, ma

27
Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA Manual Annotation of Human Genome at Broad Institute

Upload: tal

Post on 12-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Manual Annotation of Human Genome at Broad Institute. Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA. Goals. Accurate and comprehensive catalog of genes and gene products Robust annotation system for annotation of all sequenced genomes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Chinnappa Kodira

April 2004 GMOD 2004, Cambridge, MA

Manual Annotation of Human Genome at Broad Institute

Page 2: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Goals

Accurate and comprehensive catalog of genes and gene products

Robust annotation system for annotation of all sequenced genomes

Page 3: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Annotation Strategy: Evidence-based Annotation

CSMD1 gene:Gene Size: 2065,608 bases

Transcript Length: 11,297 basesProtein Length: 3565 aa

No of Exons: 68 Average length of Exons : 166 bases

Fgensh 20

Genscan 25

Blat_EST 179

mRNA 3

Page 4: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Rule-based AnnotationFL-mRNA

Species-specific ESTs

Cross-species ESTs

Protein homology

Ecores + GenePredictionsDecreasing order of confidence level

Page 5: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Annotation System

Automated GeneCaller

Publication

database

Loader

Genome Evidence

Transcript HunterManual Annotation

Argo Genome Browser

Alignment

QA

Page 6: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Critical Steps in our Annotation Process Running Computes

Selection and Filtering Evidence

Intelligent Automated Gene Caller

Genome Browser and Editor

Annotation Rules

Trained Manual Annotators

Annotation QA Process

Page 7: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Computes

Finished Sequence

Repeat Mask Homology Search

Sequence AlignmentGene Prediction

Computed Features

Filtering of High Quality Evidence•Identity >95% and >50% QS coverage

•Splice Junctions

•Rank Order

•Repeat filtering

Annotation

Raw Features

Page 8: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

TranscriptHunter

Computed Features

Exon-based Clustering

•Define Gene Locus

Intron Edge Clustering

•Identify Variants

TranscriptHunter

Creation of Gene Models•ORF and UTRs•Gene Name•Transcript Classification•Curation Flags

Page 9: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Screening of spliced ESTs contained within repeat elements

AluYb8 Repeat

Spliced ESTs

Page 10: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Manual annotation

TranscriptHunter Gene Models

•Refine Gene Boundaries

•Exon/Intron

•3’ and 5’ UTR

•Create New Genes

•Classify Transcripts

•Edit Automated Gene Calls

•Identify Pseudogenes

•Add Curation Flags

•Call/Adjust ORF

•Select PolyA Signals

AnnotDB

Page 11: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Features of Argo Attaching primary and supplemental evidence

Cluster feature display

Filtering and customizing evidence list

Display poly A signals and splice junctions

Alerting discrepancies before updating

Highlighting parent and child features

Real-time interactive analysis

ORF selection options

Tabular dump of selected features

Roll back and save work

Customization of feature display

Page 12: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Annotation View

Page 13: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Confidence levels of our gene models

Classification of transcripts –Hawk standards Known, Novel_CDS, Novel, Putative, Pseudogene

Association of primary and supplemental evidence with annotated feature

Rank order in selection of supporting evidence

Curation flags

Free text comments

Page 14: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Gene counts for Broad and Ensembl

chrom known novel known novel+putative Spl count pseudogene8 4.7 710 132 724 587 2.6 298

15 2.7 581 165 589 556 2.8 21317 2.6 1120 167 1134 578 3.3 26418 2.5 265 73 289 275 2.1 167

TOTAL 12.5 2676 537 2736 1996 942

Ensembl Broad genome

(%)

Page 15: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Manually Annotated Gene Models vs. public Gene Models

Broad

MGC

Refseq

ENSEMBL

Gene-wise

mRNA

Page 16: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Types of splice variation

Type % of variants

extra 31

skip 18

alt site 33

run on 18

CDS altered 84 %

new stop 48 %

Page 17: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Our data extend most RefSeq/MGC transcripts

distribution of extensions relative to RefSeq or MGC evidence(human chromosomes 8, 15, 17, 18)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

100 200 300 400 500 600 700 800 900 1000

length of extension (bp)

% o

f ex

ten

sio

ns

5'

3'

38 % positive for 5' extension71 % positive for 3' extension30 % positive for both79 % positive for either

median 5' extension = 46 basesmedian 3' extension = 143 bases

Page 18: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Complete 3 end as compared to Refseq mRNA and ENSEMBL gene

Page 19: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

How valid are these 3’ and 5’ extensions ?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Broad

ENSEMBL

Broad 86% 1.16%

ENSEMBL 68% 10.89%

PolyA signals5 ^ATG…STOP$

Page 20: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Using Start and Stop Codon Context to Refine

Annotation

Location of stop codons on exons

0102030405060708090

100

n n-1 n-2 n-3 n-4 n-5

exon order

% st

op co

dons stop codons

Location of start codons on exons

0

10

20

30

40

50

60

70

1 2 3 4 5 6

exon order

% sta

rt co

dons start codons

•Pseudogenes•Real Stop codons•NMD candidates•Sequence Errors•Non-coding genes•SECIS genes

•Pseudogenes•Real Start codons•NMD candidates•Sequence Errors•Non-coding genes

Page 21: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Issues with Novel and putative transcripts

•High number

•Low depth EST coverage

•Small transcript size

•Low no of variants

•Poor coding potential

•Poor cross-species conservation

•Low poly A frequency

•Weak CpG context

• Spurious transcription

• Mostly partial

• Temporal genes

• Non-coding

• Poorly expressed

• Lineage specific

Concerns Probable reasons

Page 22: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Putative Novel Known Transcript

PutativeNovel

Known

Page 23: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Annotating Non-coding mRNAs is still a challenge !!!

Sno RNAs

Page 24: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Challenges Ahead….

Establishing Common Standards

Validating Novel Transcripts

Single Exon Expressed Sequences

Determination of Accurate ORFs

Annotation of Functionally Relevant Alternative Splice Forms

Finding Sparsely Expressed Genes

Annotation of New Types of Non-coding Functional mRNAs

Incremental Update of Annotation

Capturing Biological Exceptions

Page 25: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Acknowledgements

•Reinhard Engels

•Shunguang Wang

•Seth Purcell

•Tim Elkins

•Yuhong Wu

•Serge Smirnov

•Sarah Calvo

•David Dicaprio

Annotation and Analysis

•Charlie Whittaker

•Mark Borowsky

•Sinead O’leary

•James Galagan

•Jill Mesirov

•Eric Lander

•Sequencing, Finishing and Closure Teams

Annotation Pipeline

Page 26: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

Comparison of alternative splice forms between ENSEMBL and Broad annotation

Broad

ENSEMBL

Refseq

dbEST

nrnt-mRNA

Manually Annotated Gene Models vs. public Gene Models

Page 27: Chinnappa Kodira April 2004  GMOD 2004, Cambridge, MA

ENSEMBLGENEWISE

REFSEQ

Transcript Hunter

MANUALANNOTATION

ESTs

PolyA signal

Novel Transcript Variants of Known Genes